What to Measure in a Cache

March 26, 2026

Cache metrics that explain correctness, load reduction, latency, and incident risk rather than only hit rate.

Cache observability starts by measuring more than hit rate. Hit rate is useful, but it is only one projection of behavior. A cache can have a good hit rate and still serve stale answers, overload the origin during miss bursts, hide hot-key concentration, or evict the wrong data under pressure.

The purpose of measurement is not to admire the cache. It is to answer operational questions:

Is the cache reducing origin work where it matters?
Is freshness staying within the promised bounds?
Are hot keys, evictions, or invalidation failures creating risk?
Is the cache helping latency at the user-visible layers, not just internally?

    flowchart TD
	    A["Requests"] --> B["Hit / miss metrics"]
	    A --> C["Latency metrics"]
	    D["Invalidation pipeline"] --> E["Purge success / lag"]
	    F["Cache storage"] --> G["Memory / cardinality / eviction"]
	    B --> H["Operational picture"]
	    C --> H
	    E --> H
	    G --> H

Why It Matters

The wrong metrics create false confidence. A cache dashboard that shows only hit rate may hide:

stale serves rising during incident conditions
one hot key dominating load
invalidation events lagging far behind writes
eviction churn causing preventable backend misses

Good measurement ties the cache back to product behavior, origin protection, and freshness promises.

Core Metrics

Most serious cache dashboards include several categories:

request outcomes: hit rate, miss rate, error rate, stale-serve rate
latency: cache lookup time, origin fallback time, end-to-end response time
storage pressure: memory usage, key count, item size distribution, eviction rate
invalidation health: purge success, purge lag, event backlog, replay counts
workload shape: hot-key concentration, request skew, top miss families

 1cache_metrics:
 2  request:
 3    - hit_rate
 4    - miss_rate
 5    - stale_serve_rate
 6  latency:
 7    - cache_lookup_p95_ms
 8    - origin_fallback_p95_ms
 9    - end_to_end_p95_ms
10  storage:
11    - memory_used_bytes
12    - eviction_rate
13    - key_cardinality
14  invalidation:
15    - purge_success_rate
16    - purge_lag_seconds
17    - event_backlog

What To Notice

Not all misses are equally important. A miss on a low-volume cold key may not matter. A miss on a hot homepage key during peak load matters a great deal. The same is true for stale serves. A bounded stale serve during stale-while-revalidate may be intentional and healthy. A stale serve after a sensitive write may be a bug.

That is why effective observability usually breaks metrics down by:

cache layer
key family
tenant or product surface where safe
region or node
freshness-sensitive versus latency-sensitive endpoints

Example

This sample dashboard model shows how teams often group cache metrics for action rather than for vanity.

1dashboards:
2  cache_health:
3    panels:
4      - hit_rate_by_layer
5      - origin_fallback_qps
6      - stale_serve_rate_by_endpoint
7      - top_20_hot_keys
8      - eviction_rate_by_node
9      - invalidation_lag_seconds

What to notice:

origin fallback volume matters as much as cache hit rate
stale serving should be measured intentionally, not treated as invisible behavior
per-layer views stop one healthy cache tier from hiding problems in another

Common Mistakes

tracking only hit rate and ignoring origin load
not separating expected stale serves from freshness incidents
measuring globally while missing regional or hot-key hotspots
failing to tie invalidation lag and purge health to user-visible impact

Design Review Question

How would you tell whether a cache incident is a freshness problem, a capacity problem, or an invalidation problem?

The stronger answer is that the team needs correlated signals: stale-serve behavior, miss bursts, origin fallback volume, eviction churn, and invalidation lag. A single metric rarely reveals the cause. The diagnosis comes from how those signals move together.

Quiz Time

Loading quiz…

Revised on Wednesday, June 3, 2026

14.2 Capacity Planning and Eviction