Step Functions and Workflow Engines

Describe the pattern of using managed workflow/orchestration services to coordinate retries, branching, waiting, and human review. This section should explain why explicit workflow tools matter.

Workflow engines give serverless systems a durable control plane for multi-step processes. Instead of encoding retries, waits, branches, and human approval logic inside one function or across a chain of loosely coordinated event handlers, a workflow engine stores the current state of the process explicitly and decides what should run next.

The label “step functions” is often used because several platforms expose workflow engines as a sequence of named states or steps. The underlying idea is broader and vendor-neutral: make workflow progress visible, durable, and inspectable. That is the difference between a process that survives retries cleanly and one that becomes impossible to reason about after the first operational incident.

    flowchart LR
	    A["Start workflow"] --> B["Validate request"]
	    B --> C{"Manual review needed?"}
	    C -->|No| D["Charge payment"]
	    C -->|Yes| E["Wait for approval"]
	    E --> D
	    D --> F{"Payment succeeded?"}
	    F -->|Yes| G["Create shipment"]
	    F -->|No| H["Record failure"]
	    G --> I["Complete"]

What to notice:

  • the workflow state, not the function runtime, decides what happens next
  • waits and branching are explicit instead of being buried in code paths
  • operators can inspect the current process position and failure point

Why Explicit Workflow Tools Matter

Workflow engines are strong when a process needs any combination of:

  • retries with bounded policy
  • conditional branching
  • long waits or timers
  • human approval
  • parallel steps
  • durable auditability of progress

Without an explicit workflow layer, teams often end up building a fragile coordinator function that:

  • invokes downstream functions directly
  • stores partial progress in ad hoc records
  • reimplements retry rules manually
  • cannot explain what happened after a failure

That is usually not simplicity. It is hidden orchestration.

Workflow Logic Belongs in Workflow State

The point of a workflow engine is not to move all business logic into a giant declarative file. It is to keep coordination logic separate from task logic. Each step should still do one bounded thing. The workflow layer is responsible for sequencing, branching, retry policy, waits, and visibility.

 1workflow:
 2  name: order-approval
 3  start_at: validate-order
 4  states:
 5    validate-order:
 6      type: task
 7      next: needs-review
 8    needs-review:
 9      type: choice
10      when:
11        requiresManualReview: wait-for-review
12      default: charge-payment
13    wait-for-review:
14      type: wait_for_event
15      next: charge-payment
16    charge-payment:
17      type: task
18      retry:
19        attempts: 3
20        backoff: exponential
21      next: create-shipment
22    create-shipment:
23      type: task
24      end: true

What this demonstrates:

  • step behavior is visible as workflow definition, not implied by nested function code
  • retry policy is declared at the coordination layer
  • long waits do not require a function to stay alive

Where Teams Get This Wrong

The most common failure mode is to confuse a workflow engine with a place to put all business logic. A workflow should coordinate steps, not become a monolithic rules engine full of complicated data transformation. Another mistake is to avoid workflow tooling entirely and write orchestrator code inside one function because it feels faster at the start. That approach often works for the first demo and fails during the first real retry or timeout incident.

Common Mistakes

  • writing orchestration logic inside a single handler and calling it “simple”
  • treating workflow definitions as the right place for heavy business logic
  • using long sleeps or polling loops inside functions instead of durable wait states
  • failing to model failure paths explicitly

Design Review Question

A team has one coordinating function that validates a purchase, waits for fraud review, charges a card, and triggers fulfillment by invoking other functions directly. It keeps progress in a few status flags in a database row. The team can no longer explain why some orders are stuck. What should change first?

The stronger answer is to introduce explicit workflow orchestration, not more logs inside the coordinator. The main problem is hidden control flow. A workflow engine would make state transitions, waits, retries, and failure points visible and durable, while the task functions could stay narrow and easier to test.

Check Your Understanding

Loading quiz…
Revised on Thursday, April 23, 2026