Designing for Compensation and Failure

A practical lesson on compensation, retries, duplicates, timeout handling, and manual intervention in distributed workflows that cannot rely on one global rollback.

Compensation is the set of business actions a distributed workflow uses to recover from partial success. That definition matters because compensation is often misunderstood as a technical rollback. In a distributed system, one service may already have committed durable state by the time another service fails. The earlier action cannot simply disappear. The workflow must instead move the system toward an acceptable business state through new actions such as refunding payment, releasing inventory, canceling a reservation, or routing the case to manual review.

This changes how teams should think about failure. The question is not “How do we prevent any step from ever failing?” The question is “When some steps succeed and later steps fail, what recovery path preserves correctness, trust, and operability?”

    flowchart TD
	    A["Order created"] --> B["Payment authorized"]
	    B --> C["Inventory reservation"]
	    C -->|success| D["Confirm order"]
	    C -->|failure| E["Start compensation"]
	    E --> F["Refund payment"]
	    F --> G["Mark order failed"]
	    G --> H["Notify support or customer"]

What to notice:

  • failure does not send the workflow back in time
  • compensation is a forward-moving sequence of business actions
  • support and customer communication may be part of the recovery path

Compensation Is Not Database Rollback

Database rollback is simple because the work has not been committed yet. Compensation happens after one or more local commits are already real. That is why compensation must be modeled in domain terms:

  • release reserved stock
  • cancel shipment preparation
  • refund or void a payment
  • revoke a provisional entitlement
  • mark the workflow for human review

When teams describe compensation only in technical language such as “undo step three,” they often miss business side effects, audit requirements, and customer-facing consequences.

Failure Modes You Have to Assume

Distributed workflows need to be designed around failure cases that local transactions mostly hide:

  • duplicate messages
  • retried commands
  • timeouts where the caller does not know whether the remote side succeeded
  • partial success followed by downstream failure
  • external systems that return late callbacks
  • compensations that can also fail

Ignoring these cases makes the system look cleaner during design reviews than it will feel in production.

Idempotency Is Part of Recovery

Compensation and retries both depend on idempotency. If the system retries a refund command or redelivers an inventory-release event, the handler should be able to recognize repeated work safely.

1{
2  "commandId": "cmd_5501",
3  "workflowId": "order_1042",
4  "action": "refund_payment",
5  "idempotencyKey": "order_1042_refund_v1",
6  "reason": "inventory_reservation_failed"
7}

What this demonstrates:

  • the recovery action is explicit
  • the idempotency key gives the handler a stable deduplication reference
  • compensation is recorded as business work, not just as an infrastructure retry

Without this kind of design, retries can create second-order failures such as double refunds or duplicate release operations.

Timeouts Need a Business Meaning

A timeout is not just a technical nuisance. It usually means one of three things:

  • the remote step definitely failed
  • the remote step definitely succeeded, but the caller did not observe the result
  • the caller does not know which of the first two happened

The third case is the most dangerous because it creates ambiguous workflow state. Good designs prepare for it with:

  • durable workflow state
  • reconciliation jobs
  • idempotent re-queries or callbacks
  • manual review for unresolved ambiguity

If the system treats every timeout as hard failure without reconciliation, it may compensate operations that actually succeeded.

Manual Intervention Is a Real Design Requirement

Some teams treat manual intervention as architectural defeat. In reality, certain workflows need it. High-value or high-risk processes often cannot be resolved safely by automation alone when the state becomes ambiguous. A strong design identifies:

  • when the workflow should stop retrying automatically
  • what evidence a human reviewer needs
  • how the system prevents duplicate manual resolution
  • what the final authoritative correction path is

This is part of workflow maturity, not an admission that the architecture failed.

Recovery Plans Should Be Explicit

One practical way to keep compensation design concrete is to document it as data:

 1workflow: place-order
 2step: reserve_inventory
 3on_failure:
 4  compensate:
 5    - action: refund_payment
 6      owner: payment-service
 7      idempotent: true
 8    - action: mark_order_failed
 9      owner: order-service
10      idempotent: true
11  escalation:
12    after_retries: 5
13    route_to: support-queue

What this demonstrates:

  • compensation is planned before the outage happens
  • ownership for each recovery action is explicit
  • escalation is part of the workflow model rather than a vague operational hope

Auditability Matters

Recovery paths should leave clear records. Operators and auditors often need to answer questions like:

  • Was payment captured, authorized only, or already refunded?
  • Was inventory ever reserved?
  • Did the compensation run automatically or manually?
  • Which system decided the final workflow state?

That is why compensation should be observable, durable, and traceable. If the only evidence is in scattered logs, incident response becomes guesswork.

When Compensation Is a Warning Sign

Compensation is necessary in distributed systems, but too much elaborate compensation can also reveal a decomposition problem. If nearly every normal workflow requires a long chain of risky reversals, the architecture may have split responsibilities that still want a stronger shared boundary.

The right review question is:

“Are we designing compensation because distribution is appropriate here, or because we decomposed a tightly coupled process too early?”

That question keeps failure planning from becoming a way to rationalize weak boundaries.

Design Review Question

An order workflow authorizes payment, reserves inventory, creates shipment preparation, and notifies the customer. If shipment preparation fails after payment and inventory already succeeded, the team plans to “retry until it works” and has no explicit compensation or escalation model. What is the main weakness in the design?

The main weakness is that the workflow has no defined recovery model for partial success. Retry alone does not answer what happens when the failure is persistent, ambiguous, or externally visible to the customer. A stronger design would define which steps can be retried safely, which compensations exist, when the process moves to manual review, and which service records the final authoritative state.

Quiz Time

Loading quiz…
Revised on Thursday, April 23, 2026