Long-Running Business Processes

Show how serverless can support longer workflows through orchestration, external state, durable timers, and event-driven handoffs. Explain where complexity starts to rise.

Long-running business processes are workflows that take longer than a normal request or function invocation, often because they wait on humans, external partners, scheduled windows, or multi-stage business validation. Serverless can support them well, but only if the design treats time as workflow state rather than as compute runtime.

That is the central shift. A long-running serverless process does not keep a function open for hours or days. It moves forward through durable checkpoints, timers, state transitions, and event-driven handoffs. The process is long. The compute steps should still be short.

    flowchart TD
	    A["Application submitted"] --> B["Initial validation"]
	    B --> C["Wait for manager approval"]
	    C --> D["Wait for compliance review"]
	    D --> E["Provision resources"]
	    E --> F["Send completion notice"]

What to notice:

  • time passes between states without requiring compute to stay allocated
  • each stage can be retried or inspected separately
  • external approvals and delayed steps are part of the workflow model

What Makes a Process Long-Running

A workflow becomes long-running when it depends on things like:

  • human approval
  • partner system callbacks
  • scheduled release windows
  • document review or compliance checks
  • multi-day fulfillment or provisioning steps

Those are not exceptions. They are normal business realities. The architecture becomes healthier when they are modeled directly instead of hidden behind polling loops or oversized timeout settings.

Durable Timers Beat Sleeping Functions

One of the clearest anti-patterns in serverless is using a function to represent waiting. A function should not stay alive just to mean “come back tomorrow” or “wait for an external signal.” That design is fragile, expensive, and hard to recover after failure. Durable timers and workflow waits are stronger because they store intent rather than hold open compute.

 1process:
 2  name: access-request
 3  stages:
 4    - validate-request
 5    - wait-for-manager-approval
 6    - wait-for-security-review
 7    - provision-access
 8    - notify-requester
 9  timers:
10    approval_timeout_hours: 48
11    escalation_after_hours: 24

External State and Handoffs

Long-running serverless processes often need:

  • a workflow record with current status
  • timestamps for last progress and expiration
  • correlation identifiers for callbacks or manual actions
  • event-driven handoff between stages

This is why long-running workflows are tightly related to the previous chapter on state externalization. If the current stage, previous stage, and next expected signal are not durable, the workflow is not really long-running safely. It is just long-running accidentally.

Complexity Starts Rising Here

Serverless supports long-running processes well, but the design burden increases. Teams now need to think about:

  • workflow versioning when the process definition changes mid-flight
  • human approvals that never arrive
  • escalation rules
  • re-entry after partial failure
  • observability over days or weeks rather than seconds

This is the point where a serverless system starts to behave more like a business process platform than a simple API backend.

Common Mistakes

  • using long function timeouts or sleeps to represent waiting
  • failing to model expiration, escalation, or abandoned workflows
  • assuming a human-approval step is just another synchronous API call
  • changing workflow definitions without considering in-flight instances

Design Review Question

A provisioning process runs mostly through functions and queues, but one step waits for a manager to approve access, which may take two days. The team currently polls a table every few minutes from a scheduled function. What should be improved first?

The stronger answer is to make waiting an explicit workflow state with durable timers, approval events, and escalation rules instead of a polling loop pretending to be orchestration. The current design spends compute to represent time and obscures the actual business process.

Check Your Understanding

Loading quiz…
Revised on Thursday, April 23, 2026