Sampling, Cost, and Practical Trace Retention

How head sampling, tail sampling, and retention policies change what trace evidence survives when incidents happen.

Sampling and retention decide which traces exist when responders need evidence most. Full-fidelity tracing can become expensive very quickly, especially in high-throughput systems. Most platforms therefore sample, retain selectively, or store different levels of detail for different time horizons. Those decisions are not only budget decisions. They shape what kinds of incidents can still be investigated later.

Head sampling makes the decision near the beginning of a request, before the full outcome is known. Tail sampling waits longer and can prefer slow, failed, or otherwise interesting traces. Retention adds another layer: even if a trace is captured, the detailed version may not remain available for long. Teams therefore need to design their sampling policy around the incidents and service objectives they expect to investigate, not only around raw ingest volume.

    flowchart TD
	    A["Incoming requests"] --> B{"Sampling decision"}
	    B --> C["Head sampling"]
	    B --> D["Tail sampling"]
	    C --> E["Lower cost, less outcome awareness"]
	    D --> F["Better selection, more processing complexity"]
	    E --> G["Retention policy"]
	    F --> G
	    G --> H["What evidence is available later"]

Sampling Changes What You Can Prove

A healthy policy often mixes several ideas:

  • always keep error traces
  • strongly prefer very slow traces
  • sample routine successful traffic at a low rate
  • retain a short hot window with full detail
  • move older trace data to cheaper, reduced-detail storage if the platform supports it
 1trace_sampling:
 2  head_sample_rate: 0.05
 3  tail_policies:
 4    - name: keep_errors
 5      match: "status=error"
 6      sample_rate: 1.0
 7    - name: keep_slow_requests
 8      match: "duration_ms > 1000"
 9      sample_rate: 1.0
10    - name: keep_normal_successes
11      match: "status=ok"
12      sample_rate: 0.02
13retention:
14  hot_days: 3
15  searchable_days: 14
16  archive_days: 30

What to notice:

  • the policy is biased toward traces with operational value
  • not all traffic is treated equally
  • retention is tiered because investigation value changes over time

Cost Control Should Not Blind The System

Weak sampling policies often fail in one of two ways:

  • they keep too much and make the platform expensive and noisy
  • they keep too little of the important cases, so incidents cannot be reconstructed

The strongest policies are explicit about what they optimize for: failed requests, tail latency, rare workflows, or high-value customer journeys.

Design Review Question

If a platform uses simple low-rate head sampling for all traffic and later misses most rare slow requests, what was the core design weakness?

The stronger answer is that the sampling policy was blind to outcome quality. It controlled cost, but it did not preserve the rare traces most useful for diagnosing slow failures.

Quiz Time

Loading quiz…
Revised on Thursday, April 23, 2026