How to design logs, metrics, traces, and events starting from the operational questions teams need to answer, not from tool defaults.
Question-driven instrumentation starts from what operators need to know, not from what a telemetry platform happens to collect easily. This is one of the clearest differences between mature observability and accidental observability. Mature systems ask: which questions appear during failures, latency regressions, release reviews, and customer escalations, and what evidence must exist so those questions can be answered quickly?
Teams that skip this step often instrument by signal type instead:
That may produce activity, but it does not guarantee diagnosability. A better approach maps each operational question to the evidence needed to answer it.
Useful starting questions are concrete:
These questions are strong because they force clarity about scope, causality, and decision-making.
flowchart LR
A["Operational question"] --> B["Choose evidence"]
B --> C["Logs"]
B --> D["Metrics"]
B --> E["Traces"]
B --> F["Events"]
C --> G["Dashboards, triage, and incident response"]
D --> G
E --> G
F --> G
A common mistake is trying to map each question to exactly one telemetry family. Real questions usually need more than one:
This is why instrumentation design should think in evidence bundles rather than in one-signal answers.
1question_map:
2 - question: "Which dependency caused checkout latency?"
3 evidence:
4 - latency_histogram_by_route
5 - dependency_trace_spans
6 - structured_timeout_logs
7 - question: "Which customers are affected?"
8 evidence:
9 - route_error_rate_by_region
10 - tenant-aware logs
11 - workflow state events
12 - question: "Did the async retry path succeed later?"
13 evidence:
14 - queue_age_metric
15 - retry_scheduled_event
16 - worker completion log
What to notice:
Question-first design also helps prevent vanity dashboards and low-value logs. If a metric, field, or trace attribute cannot be tied to a meaningful operational question, it should face a high bar before being added. This does not mean every signal needs a unique question written in a document. It means signals should exist because they help people reason, not because they are easy to emit.
This is especially important in large systems, where every extra label, record, or span attribute has cost and complexity.
The best observability questions are not trivia questions. They are decision questions:
If a signal does not improve one of those choices, it may not deserve long-term collection.
If a team says “we have a lot of telemetry” but cannot list which operational questions that telemetry is meant to answer, what risk is it taking?
The stronger answer is that it is collecting evidence without a diagnosis model. That usually leads to noisy dashboards, missing context, and expensive data that does not support better decisions.