How head sampling, tail sampling, and retention policies change what trace evidence survives when incidents happen.
Sampling and retention decide which traces exist when responders need evidence most. Full-fidelity tracing can become expensive very quickly, especially in high-throughput systems. Most platforms therefore sample, retain selectively, or store different levels of detail for different time horizons. Those decisions are not only budget decisions. They shape what kinds of incidents can still be investigated later.
Head sampling makes the decision near the beginning of a request, before the full outcome is known. Tail sampling waits longer and can prefer slow, failed, or otherwise interesting traces. Retention adds another layer: even if a trace is captured, the detailed version may not remain available for long. Teams therefore need to design their sampling policy around the incidents and service objectives they expect to investigate, not only around raw ingest volume.
flowchart TD
A["Incoming requests"] --> B{"Sampling decision"}
B --> C["Head sampling"]
B --> D["Tail sampling"]
C --> E["Lower cost, less outcome awareness"]
D --> F["Better selection, more processing complexity"]
E --> G["Retention policy"]
F --> G
G --> H["What evidence is available later"]
A healthy policy often mixes several ideas:
1trace_sampling:
2 head_sample_rate: 0.05
3 tail_policies:
4 - name: keep_errors
5 match: "status=error"
6 sample_rate: 1.0
7 - name: keep_slow_requests
8 match: "duration_ms > 1000"
9 sample_rate: 1.0
10 - name: keep_normal_successes
11 match: "status=ok"
12 sample_rate: 0.02
13retention:
14 hot_days: 3
15 searchable_days: 14
16 archive_days: 30
What to notice:
Weak sampling policies often fail in one of two ways:
The strongest policies are explicit about what they optimize for: failed requests, tail latency, rare workflows, or high-value customer journeys.
If a platform uses simple low-rate head sampling for all traffic and later misses most rare slow requests, what was the core design weakness?
The stronger answer is that the sampling policy was blind to outcome quality. It controlled cost, but it did not preserve the rare traces most useful for diagnosing slow failures.