A practical lesson on sagas as long-running distributed business processes built from local transactions plus coordinated recovery paths.
Sagas model a long-running distributed business process as a sequence of local transactions plus explicit failure recovery logic. Instead of trying to make several services participate in one distributed atomic transaction, each service commits its own local work. If a later step fails, the system runs compensation or correction logic for the earlier completed steps.
This is one of the most important mindset shifts in event-driven architecture. The system stops pretending it can roll back everything mechanically. It accepts that business progress across services happens in stages, and that recovery often means semantic correction, not technical undo.
flowchart LR
A["Reserve inventory"] --> B["Authorize payment"]
B --> C["Create shipment"]
C --> D["Workflow complete"]
B -. failure .-> E["Release inventory"]
C -. failure .-> F["Void or refund payment"]
What to notice:
Teams sometimes confuse saga with choreography or orchestration. A saga can use either control style. The core idea is not whether there is a coordinator. It is that the workflow is built from local transactions and that recovery is explicit.
This matters because the architectural problem is not only “who sends the next message?” It is “what happens after two successful steps if step three cannot complete?” Sagas answer that by modeling business recovery paths directly.
Sagas are useful when:
This often applies to order fulfillment, payment plus inventory plus shipping flows, booking systems, and many multi-service approval or provisioning workflows.
Each saga step should represent a real local business commitment. That means:
1saga:
2 name: order-fulfillment
3 steps:
4 - name: reserve-inventory
5 onSuccess: inventory_reserved
6 - name: authorize-payment
7 onSuccess: payment_authorized
8 - name: create-shipment
9 onSuccess: shipment_created
This kind of model is useful because it names the workflow in business terms rather than only in technical retries.
A saga definition that only lists forward steps is incomplete. Real saga design also asks:
This is why saga modeling is not just a way to string services together. It is a way to state business process truth under uncertainty.
Because sagas run over time, often across services and asynchronous boundaries, operators need visibility into:
Without that, teams may know that “something failed in order processing” but not which step committed and which correction path is now active.
A team models an order flow as a saga but only documents reserve inventory, charge card, and create shipment. There is no failure table, compensation map, or operator state view. What is the strongest critique?
The strongest critique is that the team has documented a happy path, not a full saga. A saga is defined as much by its recovery behavior as by its forward steps. Without explicit failure and compensation design, the system still lacks a trustworthy distributed process model.