Explain retry behavior in synchronous and asynchronous serverless systems, and why idempotency is one of the most important design requirements for safe reprocessing.
Retries, backoff, and idempotency form the core reliability contract of most serverless systems. Retries happen because networks fail, dependencies time out, workers crash, and platforms redeliver messages. That means the real question is rarely “will this operation run twice?” It is “what happens when it does?”
Serverless makes retry behavior more visible because compute is short-lived and many execution paths are event-driven. A synchronous API may retry at the client, gateway, or SDK layer. An asynchronous workflow may retry at the queue, stream, workflow engine, or consumer layer. If the handler is not idempotent, those retries can create duplicate charges, repeated emails, or inconsistent state transitions.
flowchart LR
A["Request or event"] --> B["Function"]
B --> C{"Dependency succeeds?"}
C -->|Yes| D["Persist success marker"]
C -->|No| E["Retry with backoff"]
E --> B
B --> F["Idempotency store"]
What to notice:
In synchronous serverless systems, retries are usually constrained by user-facing latency. The caller expects a response soon, so timeouts and retry counts should be conservative. In asynchronous systems, retries can be more patient because the caller is no longer blocked, but the system still needs bounded retry policy and eventual quarantine for hopeless messages.
The anti-pattern is to treat both paths the same. A payment API should not blindly retry for minutes on the request path. A background queue worker should not give up after one transient error.
Idempotency means that reprocessing the same operation does not create a new business effect. That can be implemented through:
It is not the same as “this code happens to be safe most of the time.” The business must define the identity of the operation clearly enough that duplicates can be recognized.
1idempotency:
2 key_source: requestId
3 store: key-value
4 ttl_hours: 24
5 on_duplicate: return_previous_result
6
7retry_policy:
8 max_attempts: 5
9 backoff: exponential
10 jitter: full
1export async function createInvoice(command: {
2 requestId: string;
3 accountId: string;
4 amount: number;
5}) {
6 const existing = await idempotencyStore.get(command.requestId);
7 if (existing) {
8 return existing.result;
9 }
10
11 const invoice = await invoiceService.create(command.accountId, command.amount);
12
13 await idempotencyStore.put(command.requestId, {
14 result: invoice,
15 status: "complete",
16 });
17
18 return invoice;
19}
What this demonstrates:
If every failing function retries immediately, failure amplifies itself. Backoff spaces retries over time. Jitter adds randomness so many workers do not retry in synchronized waves. This is especially important in serverless because autoscaling can produce many failing consumers at once.
A function charges a card and then writes an order record. When the write times out, the platform retries the function and sometimes charges the customer twice. What should be fixed first?
The stronger answer is not simply “reduce retries.” The main flaw is missing idempotency around the business effect. The operation needs a durable request identity and a duplicate-safe write path so a retried invocation cannot create another charge just because the later persistence step was ambiguous.