Runbooks, Replay, and Recovery

March 26, 2026

Preparing cache incident runbooks, replaying invalidations, and recovering safely from cold starts or partial failure.

Cache runbooks and recovery plans matter because cache incidents rarely end when the faulty node restarts. Recovery changes traffic shape, reintroduces cold misses, and can accidentally restore stale state if replay and refill logic are poorly sequenced. A cache that returns quickly after failure is not necessarily a cache that recovers safely.

Operational maturity means the team has already decided what to do for the most likely bad days:

broad purge gone wrong
invalidation stream lag or outage
cache cluster restart or cold region failover
stale values reappearing during refill
need for temporary cache bypass

    flowchart TD
	    A["Cache incident detected"] --> B["Classify failure mode"]
	    B --> C["Protect origin capacity"]
	    B --> D["Decide stale vs bypass policy"]
	    C --> E["Replay invalidations or warm critical keys"]
	    D --> E
	    E --> F["Verify freshness and hit recovery"]

Why It Matters

Without runbooks, teams improvise under pressure. Improvisation usually defaults to extreme actions: flush everything, disable all caching, or rebuild blindly. Those actions can be necessary, but only if the team understands how they affect origin load, user-visible staleness, and replay order.

Recovery planning should answer:

What can safely be served stale during origin distress?
Which key families must be warmed first after a cold start?
How are invalidations replayed or reconciled after lag?
When is cache bypass safer than incomplete recovery?

What a Useful Runbook Contains

A cache runbook should usually include:

failure mode identification cues
immediate origin-protection steps
stale-serving or bypass rules
invalidation replay or reconciliation steps
warmup order for critical keys and regions
validation checks before declaring recovery complete

 1runbook:
 2  incident: invalidation-stream-lag
 3  immediate_actions:
 4    - cap_origin_concurrency
 5    - enable_stale_if_error_for_public_content
 6    - alert_on_purge_backlog_growth
 7  recovery:
 8    - replay_events_from_sequence: 9013
 9    - warm_keys:
10        - homepage:top-products
11        - category:laptops
12        - pricing:public-plans
13  verify:
14    - purge_backlog_zero
15    - origin_qps_within_baseline
16    - freshness_probe_passed

Replay and Recovery

Replay is especially dangerous when ordering matters. If old invalidations or stale refill jobs run after newer truth has already been established, the recovery path can reintroduce bad state. That is why replay often needs:

sequence numbers or versions
bounded time windows
key-family scoping
validation probes after replay, not only before it

Warmup is also selective. A full-cache warmup is often wasteful or too expensive. The priority is usually the small set of hot keys and business-critical surfaces that shape user experience and origin stability.

Common Mistakes

assuming a restarted cache is “fixed” before verifying freshness and origin pressure
replaying old invalidations or refill jobs without ordering safeguards
warming everything instead of warming the critical working set
declaring recovery complete based on node health rather than cache behavior

Design Review Question

What makes a cache recovery plan safe rather than merely fast?

The stronger answer is that a safe recovery plan protects the origin, preserves ordering during replay, warms the highest-value keys first, and verifies both freshness and load behavior before the incident is considered over. Fast recovery without those checks can reintroduce stale data or cause a secondary overload incident.

Quiz Time

Loading quiz…

Revised on Wednesday, June 3, 2026

14.3 Diagnosing Cache Incidents