Saga and Compensation Patterns

Describe how multi-step serverless workflows can handle partial failure through compensation, reversals, and explicit recovery logic.

Saga and compensation patterns deal with a hard truth of distributed serverless workflows: many multi-step business processes cannot rely on one global transaction. Instead, each step commits locally, and if a later step fails, the system executes compensating actions that try to bring the overall business outcome back to a safe state.

This is not the same as a database rollback. A compensation step is a new action with business meaning: refund a payment, release inventory, send a correction event, or mark a request as canceled. That difference matters because some effects cannot be perfectly undone. Compensation is a recovery design, not a time machine.

    flowchart LR
	    A["Reserve inventory"] --> B["Charge payment"]
	    B --> C["Create shipment"]
	    C --> D["Complete order"]
	    C -->|fails| E["Refund payment"]
	    E --> F["Release inventory"]
	    F --> G["Mark order failed"]

What to notice:

  • each forward step has business meaning and may commit independently
  • compensation steps are explicit and ordered, not implied rollback magic
  • failure handling is part of the workflow design, not an afterthought

When a Saga Is the Right Tool

Saga-style design is useful when:

  • several services own their own data and side effects
  • a process spans multiple local transactions
  • the business can define compensating actions for partial failure
  • eventual consistency is acceptable

It is weaker when the process truly needs one small local transaction and can stay inside a single bounded service or database. Teams should not introduce saga complexity where a simpler consistency boundary already exists.

Compensation Must Be Business-Meaningful

A good compensation step answers a business question: what should happen now that the forward path failed after some effects already occurred? Examples include:

  • refund payment
  • release reserved stock
  • cancel pending provisioning
  • emit a correction event to downstream systems

The anti-pattern is to assume every forward action has a perfect reverse. Some actions are irreversible or only partially reversible. Email may already have been sent. An external partner may already have consumed a webhook. Compensation has to reflect that reality.

 1saga:
 2  steps:
 3    - name: reserve-inventory
 4      compensate: release-inventory
 5    - name: charge-payment
 6      compensate: refund-payment
 7    - name: create-shipment
 8      compensate: cancel-shipment-if-possible
 9  on_failure:
10    mode: reverse-completed-steps
 1type SagaStep = {
 2  name: string;
 3  run: () => Promise<void>;
 4  compensate: () => Promise<void>;
 5};
 6
 7export async function executeSaga(steps: SagaStep[]) {
 8  const completed: SagaStep[] = [];
 9
10  try {
11    for (const step of steps) {
12      await step.run();
13      completed.push(step);
14    }
15  } catch (error) {
16    for (const step of completed.reverse()) {
17      await step.compensate();
18    }
19    throw error;
20  }
21}

What this demonstrates:

  • forward and compensating actions are paired intentionally
  • compensation usually happens in reverse completion order
  • recovery logic is explicit instead of hidden in generic retry behavior

Common Mistakes

  • calling a distributed process a saga without defining compensation steps
  • assuming compensation is identical to rollback
  • forgetting idempotency for compensating actions
  • using saga complexity where a simpler local transaction would have been enough

Design Review Question

An onboarding workflow creates an account, provisions resources, and sends a welcome email. If provisioning fails, the team wants to “roll everything back.” What should the design review challenge first?

The stronger answer is the phrase “roll everything back.” Some effects are easy to compensate, such as deleting a pending account or refunding a charge. Others, like a sent email or an external audit record, may not be reversible. The right design is to define concrete compensating actions and acceptable residual effects, not to assume rollback semantics that the distributed system does not actually have.

Check Your Understanding

Loading quiz…
Revised on Thursday, April 23, 2026