How to observe long-running workflows, compensating actions, and partial completion when no single request represents the whole business process.
Workflow and saga observability matters because many business outcomes are not represented by one request or one transaction. They unfold over time through multiple steps, state transitions, and sometimes compensating actions. That means the key observability question is not only “did this request fail?” but “what stage is this workflow in, what has already completed, and what still needs to happen or unwind?”
Traditional request-centric telemetry is necessary but incomplete here. Teams also need workflow identity, state-transition visibility, timeout detection, compensation visibility, and business-level completion metrics. Without those, a system can look healthy at the service layer while many business processes are stuck midway.
stateDiagram-v2
[*] --> Created
Created --> PaymentAuthorized
PaymentAuthorized --> InventoryReserved
InventoryReserved --> FulfillmentStarted
FulfillmentStarted --> Completed
InventoryReserved --> Compensation
PaymentAuthorized --> Compensation
Compensation --> Failed
A strong workflow or saga observability model usually includes:
1workflow_visibility:
2 identity:
3 - workflow_id
4 - correlation_id
5 progress:
6 - current_state
7 - state_entered_at
8 - step_duration
9 failure:
10 - timeout_count
11 - compensation_count
12 - abandoned_workflows
What to notice:
Sagas make this especially important. Several steps may succeed before one later step fails and triggers compensation. If observability tracks only service-local success, the system may look healthy while users experience inconsistent or delayed outcomes. Workflow-level observability restores the business view.
If several services report successful local operations but customer orders still disappear into an unresolved intermediate state, what observability layer is missing?
The stronger answer is workflow-level state visibility. Local step success exists, but the end-to-end business process cannot be tracked reliably through completion, timeout, or compensation.