Sagas

A practical lesson on sagas as long-running distributed business processes built from local transactions plus coordinated recovery paths.

Sagas model a long-running distributed business process as a sequence of local transactions plus explicit failure recovery logic. Instead of trying to make several services participate in one distributed atomic transaction, each service commits its own local work. If a later step fails, the system runs compensation or correction logic for the earlier completed steps.

This is one of the most important mindset shifts in event-driven architecture. The system stops pretending it can roll back everything mechanically. It accepts that business progress across services happens in stages, and that recovery often means semantic correction, not technical undo.

    flowchart LR
	    A["Reserve inventory"] --> B["Authorize payment"]
	    B --> C["Create shipment"]
	    C --> D["Workflow complete"]
	    B -. failure .-> E["Release inventory"]
	    C -. failure .-> F["Void or refund payment"]

What to notice:

  • each step is a local committed action
  • failure later in the process does not erase earlier commits automatically
  • the saga needs explicit forward progress and recovery paths

Sagas Are About Process Recovery, Not Just Control Style

Teams sometimes confuse saga with choreography or orchestration. A saga can use either control style. The core idea is not whether there is a coordinator. It is that the workflow is built from local transactions and that recovery is explicit.

This matters because the architectural problem is not only “who sends the next message?” It is “what happens after two successful steps if step three cannot complete?” Sagas answer that by modeling business recovery paths directly.

Why Sagas Exist

Sagas are useful when:

  • the process spans multiple services
  • each service owns its own data and local transaction boundary
  • one all-or-nothing distributed commit is impractical or undesirable
  • business completion depends on several steps over time

This often applies to order fulfillment, payment plus inventory plus shipping flows, booking systems, and many multi-service approval or provisioning workflows.

Local Transactions Need Durable Workflow Meaning

Each saga step should represent a real local business commitment. That means:

  • the local work is durable
  • the emitted event or reply reflects a meaningful state change
  • the next step knows what prior commitments exist
1saga:
2  name: order-fulfillment
3  steps:
4    - name: reserve-inventory
5      onSuccess: inventory_reserved
6    - name: authorize-payment
7      onSuccess: payment_authorized
8    - name: create-shipment
9      onSuccess: shipment_created

This kind of model is useful because it names the workflow in business terms rather than only in technical retries.

Sagas Need More Than Happy-Path Steps

A saga definition that only lists forward steps is incomplete. Real saga design also asks:

  • what failure types are expected at each step
  • which earlier steps need compensation
  • whether compensation is immediate or deferred
  • whether some failures should pause for human review
  • whether the workflow can end in a partial-but-acceptable state

This is why saga modeling is not just a way to string services together. It is a way to state business process truth under uncertainty.

Observability Matters

Because sagas run over time, often across services and asynchronous boundaries, operators need visibility into:

  • current workflow state
  • completed steps
  • pending or failed steps
  • compensation status
  • correlation across emitted events and replies

Without that, teams may know that “something failed in order processing” but not which step committed and which correction path is now active.

Common Mistakes

  • describing only the happy path and calling it a saga
  • assuming orchestration alone makes a workflow a saga
  • using sagas where a simpler local transaction plus event is enough
  • failing to define compensation or correction behavior per completed step
  • forgetting that some failures require pause-and-review rather than automatic reversal

Design Review Question

A team models an order flow as a saga but only documents reserve inventory, charge card, and create shipment. There is no failure table, compensation map, or operator state view. What is the strongest critique?

The strongest critique is that the team has documented a happy path, not a full saga. A saga is defined as much by its recovery behavior as by its forward steps. Without explicit failure and compensation design, the system still lacks a trustworthy distributed process model.

Quiz Time

Loading quiz…
Revised on Thursday, April 23, 2026