Cache metrics that explain correctness, load reduction, latency, and incident risk rather than only hit rate.
Cache observability starts by measuring more than hit rate. Hit rate is useful, but it is only one projection of behavior. A cache can have a good hit rate and still serve stale answers, overload the origin during miss bursts, hide hot-key concentration, or evict the wrong data under pressure.
The purpose of measurement is not to admire the cache. It is to answer operational questions:
flowchart TD
A["Requests"] --> B["Hit / miss metrics"]
A --> C["Latency metrics"]
D["Invalidation pipeline"] --> E["Purge success / lag"]
F["Cache storage"] --> G["Memory / cardinality / eviction"]
B --> H["Operational picture"]
C --> H
E --> H
G --> H
The wrong metrics create false confidence. A cache dashboard that shows only hit rate may hide:
Good measurement ties the cache back to product behavior, origin protection, and freshness promises.
Most serious cache dashboards include several categories:
1cache_metrics:
2 request:
3 - hit_rate
4 - miss_rate
5 - stale_serve_rate
6 latency:
7 - cache_lookup_p95_ms
8 - origin_fallback_p95_ms
9 - end_to_end_p95_ms
10 storage:
11 - memory_used_bytes
12 - eviction_rate
13 - key_cardinality
14 invalidation:
15 - purge_success_rate
16 - purge_lag_seconds
17 - event_backlog
Not all misses are equally important. A miss on a low-volume cold key may not matter. A miss on a hot homepage key during peak load matters a great deal. The same is true for stale serves. A bounded stale serve during stale-while-revalidate may be intentional and healthy. A stale serve after a sensitive write may be a bug.
That is why effective observability usually breaks metrics down by:
This sample dashboard model shows how teams often group cache metrics for action rather than for vanity.
1dashboards:
2 cache_health:
3 panels:
4 - hit_rate_by_layer
5 - origin_fallback_qps
6 - stale_serve_rate_by_endpoint
7 - top_20_hot_keys
8 - eviction_rate_by_node
9 - invalidation_lag_seconds
What to notice:
How would you tell whether a cache incident is a freshness problem, a capacity problem, or an invalidation problem?
The stronger answer is that the team needs correlated signals: stale-serve behavior, miss bursts, origin fallback volume, eviction churn, and invalidation lag. A single metric rarely reveals the cause. The diagnosis comes from how those signals move together.