Cache Coherence and Coordination

Propagation lag, invalidation coordination, and the limits of coherence across distributed cache nodes.

Distributed cache coherence is the problem of keeping several cache nodes, layers, or replicas meaningfully aligned after source data changes. In a distributed system, not every node observes invalidation or refill at the same time. A write in one place may race with reads elsewhere. That means coherence is almost always partial, delayed, or bounded rather than perfect.

The important design question is not “can we keep every cache identical at all times?” That is rarely practical. The real question is what divergence is acceptable, how updates propagate, and what behavior the system chooses while nodes are temporarily out of sync.

    flowchart TD
	    A["Source write"] --> B["Invalidation event"]
	    B --> C["Cache node A"]
	    B --> D["Cache node B"]
	    B --> E["Cache node C"]
	    C --> F["Reads see new state sooner"]
	    D --> G["Reads lag briefly"]
	    E --> H["Refresh delayed or missed"]

Why It Matters

Once several nodes serve the same logical data, coherence lag becomes part of correctness. Users may receive different answers depending on which node they hit. Operators may see bugs that reproduce only under specific traffic distribution. The cache stops being one object and becomes a fleet behavior problem.

Coherence work matters most when:

  • the same key is served by several nodes or regions
  • writes are visible quickly to many readers
  • invalidation travels asynchronously
  • node-local refill logic can race with distributed events

Coordination Strategies

Distributed systems usually use one or more of these approaches:

  • broadcast invalidation events to all interested nodes
  • use versioned keys so nodes converge by reading newer generations
  • route hot or mutable keys through a smaller coordination surface
  • accept bounded incoherence and rely on TTL plus version checks as a safety net

Pure synchronous coherence is rare because it couples cache availability to the coordination path. Many systems instead choose “fast convergence with bounded drift.”

Example

This coordination model shows how a write can carry a monotonic version into invalidation.

1cache_invalidation_event:
2  key_family: product:42
3  new_version: 18
4  action: invalidate_or_reject_older_writes
5  propagation:
6    transport: event-bus
7    delivery: at-least-once

What to notice:

  • version information helps nodes reason about ordering even if delivery is delayed
  • the event does not need to make all nodes identical instantly to still be useful
  • the propagation contract is part of the cache design, not just transport plumbing

Trade-Offs

Coherence costs grow with distribution.

  • Faster propagation reduces drift but increases coordination coupling.
  • Node-local autonomy reduces coupling but increases temporary divergence.
  • Broadcast invalidation is simple conceptually but may become noisy at scale.
  • Versioned convergence is robust but requires ordering metadata and dead-entry cleanup.

The right strategy depends on how visible and expensive incoherence is for the workload.

Common Mistakes

  • treating asynchronous invalidation as if it were immediate and reliable everywhere
  • assuming local locks solve distributed coherence problems
  • omitting version metadata from invalidation events
  • designing for global exactness when the workload only needs bounded freshness

Design Review Question

When is “fast convergence with bounded drift” a better goal than exact distributed coherence?

The stronger answer is that fast convergence is often the right goal when reads are numerous, writes are frequent enough to make strict coordination expensive, and short-lived divergence is acceptable. Exact coherence is worth more only when stale cross-node answers would be materially harmful.

Quiz Time

Loading quiz…
Revised on Thursday, April 23, 2026