Retries, Backoff, and Idempotency

March 23, 2026

Explain retry behavior in synchronous and asynchronous serverless systems, and why idempotency is one of the most important design requirements for safe reprocessing.

Retries, backoff, and idempotency form the core reliability contract of most serverless systems. Retries happen because networks fail, dependencies time out, workers crash, and platforms redeliver messages. That means the real question is rarely “will this operation run twice?” It is “what happens when it does?”

Serverless makes retry behavior more visible because compute is short-lived and many execution paths are event-driven. A synchronous API may retry at the client, gateway, or SDK layer. An asynchronous workflow may retry at the queue, stream, workflow engine, or consumer layer. If the handler is not idempotent, those retries can create duplicate charges, repeated emails, or inconsistent state transitions.

    flowchart LR
	    A["Request or event"] --> B["Function"]
	    B --> C{"Dependency succeeds?"}
	    C -->|Yes| D["Persist success marker"]
	    C -->|No| E["Retry with backoff"]
	    E --> B
	    B --> F["Idempotency store"]

What to notice:

retries are normal behavior, not exceptional behavior
idempotency must be checked against durable state, not just in-memory flags
backoff protects dependencies from immediate retry storms

Synchronous and Asynchronous Retries Are Different

In synchronous serverless systems, retries are usually constrained by user-facing latency. The caller expects a response soon, so timeouts and retry counts should be conservative. In asynchronous systems, retries can be more patient because the caller is no longer blocked, but the system still needs bounded retry policy and eventual quarantine for hopeless messages.

The anti-pattern is to treat both paths the same. A payment API should not blindly retry for minutes on the request path. A background queue worker should not give up after one transient error.

Idempotency Is a Business Design Decision

Idempotency means that reprocessing the same operation does not create a new business effect. That can be implemented through:

request identifiers
deduplication tables
conditional writes
state-transition guards

It is not the same as “this code happens to be safe most of the time.” The business must define the identity of the operation clearly enough that duplicates can be recognized.

 1idempotency:
 2  key_source: requestId
 3  store: key-value
 4  ttl_hours: 24
 5  on_duplicate: return_previous_result
 6
 7retry_policy:
 8  max_attempts: 5
 9  backoff: exponential
10  jitter: full

 1export async function createInvoice(command: {
 2  requestId: string;
 3  accountId: string;
 4  amount: number;
 5}) {
 6  const existing = await idempotencyStore.get(command.requestId);
 7  if (existing) {
 8    return existing.result;
 9  }
10
11  const invoice = await invoiceService.create(command.accountId, command.amount);
12
13  await idempotencyStore.put(command.requestId, {
14    result: invoice,
15    status: "complete",
16  });
17
18  return invoice;
19}

What this demonstrates:

the operation has a durable identity
repeated delivery returns the prior result instead of creating another invoice
idempotency is enforced at the business-effect boundary

Backoff and Jitter Matter

If every failing function retries immediately, failure amplifies itself. Backoff spaces retries over time. Jitter adds randomness so many workers do not retry in synchronized waves. This is especially important in serverless because autoscaling can produce many failing consumers at once.

Common Mistakes

assuming platform retries are harmless without designing idempotency
using timestamps or unstable payload fragments as deduplication keys
retrying permanent business failures as if they were transient infrastructure errors
forgetting that several retry layers may exist in the same request path

Design Review Question

A function charges a card and then writes an order record. When the write times out, the platform retries the function and sometimes charges the customer twice. What should be fixed first?

The stronger answer is not simply “reduce retries.” The main flaw is missing idempotency around the business effect. The operation needs a durable request identity and a duplicate-safe write path so a retried invocation cannot create another charge just because the later persistence step was ambiguous.

Check Your Understanding

Loading quiz…

Revised on Wednesday, June 3, 2026

10.2 Timeouts, Circuit Breakers, and Fallbacks