Golden Signals and High-Value Service Metrics

How latency, traffic, errors, and saturation create a compact service-health model and where that model still needs local adaptation.

Golden signals are a compact way to describe service health through latency, traffic, errors, and saturation. The idea is valuable because it gives teams a small default set of questions to ask before every system invents its own health model. When the service is degraded, responders usually need to know some version of the same things: are requests slower, has load changed, are failures increasing, and is the system running out of some constrained resource?

That compact model is useful, but it should not be treated as magic vocabulary. A background worker, a streaming pipeline, and an interactive API do not express “traffic” or “latency” in identical ways. The principle is stable; the concrete metric set still needs to reflect how the system creates user value.

    flowchart LR
	    A["User request or workload"] --> B["Traffic"]
	    A --> C["Latency"]
	    A --> D["Errors"]
	    A --> E["Saturation"]
	    B --> F["Service health view"]
	    C --> F
	    D --> F
	    E --> F

The Golden Signals Are A Starting Model

The practical value of the model is prioritization:

  • latency shows whether work is completing within acceptable time
  • traffic shows demand, request volume, or work intake
  • errors show failed or degraded outcomes
  • saturation shows whether a constrained resource is approaching exhaustion

Used together, these signals tell a better story than any single chart. Rising latency without rising traffic suggests a different problem than rising latency during a sudden demand spike. Rising errors with flat saturation suggests a different failure mode than high error rates during resource exhaustion.

 1service_health:
 2  latency:
 3    metric: http_request_duration_seconds
 4    focus: ["p50", "p95", "p99"]
 5  traffic:
 6    metric: http_requests_total
 7    derived_view: requests_per_second
 8  errors:
 9    metric: http_requests_total
10    filter: 'status=~"5.."'
11    derived_view: error_rate
12  saturation:
13    metric: worker_pool_in_use_ratio
14    threshold: 0.85

High-Value Metrics Depend On Service Shape

A good service-health panel often includes the golden signals plus a few service-specific metrics:

  • queue age for async consumers
  • lag for streaming systems
  • cache hit ratio for latency-sensitive read paths
  • backlog depth for workers
  • throttling rate for dependency-bound services

The mistake is not extending the model. The mistake is either treating the generic four as enough for every workload, or exploding the dashboard with dozens of equally important charts so that nothing is actually important.

Design Review Question

If a queue consumer team tracks only CPU and memory, but never monitors backlog growth or message age, what part of service health is missing?

The stronger answer is workload-specific traffic and latency meaning. The team is watching machine resources, but not the flow and freshness of the work the service exists to process.

Quiz Time

Loading quiz…
Revised on Thursday, April 23, 2026