Handling Network Errors and Retries

How to handle unreliable networks in Clojure with timeout discipline, idempotent retries, backoff with jitter, and circuit-aware client design.

Handling network errors and retries is less about writing a loop and more about deciding which failures are safe to repeat, how long to wait, and when to stop making a bad situation worse. Good retry policy protects the caller, the downstream dependency, and the rest of the platform.

The most common mistake is to treat every failure as retriable. That creates duplicate writes, retry storms, and self-inflicted incidents. A stronger design starts by classifying failures and by making idempotency an explicit part of the protocol.

Classify Failures Before You Retry

Not every failure deserves another attempt.

Usually retriable:

  • connection resets
  • timeouts
  • DNS hiccups
  • HTTP 429
  • HTTP 502, 503, and 504

Usually not retriable without additional logic:

  • validation failures
  • authentication and authorization errors
  • malformed requests
  • domain rejections such as “insufficient balance”

The key question is not “did the request fail?” It is “is another attempt likely to help without causing incorrect side effects?”

Idempotency Comes First

Retries are safest when the operation is idempotent. Reads usually are. Writes often are not unless you design them that way.

Typical approaches:

  • idempotency keys for create-style operations
  • request deduplication on the server
  • natural upsert semantics
  • workflow identifiers carried across retries

If you cannot explain why a repeated request is safe, do not casually add retries around it.

Timeouts Are Part of the Contract

A retry policy without timeout policy is incomplete. You need at least:

  • a connect timeout
  • a request timeout
  • a total budget for all attempts combined

Without those limits, a dependency failure turns into thread starvation, queue buildup, and user-facing latency spikes.

Use Capped Exponential Backoff with Jitter

Backoff reduces synchronized retry storms. Jitter keeps clients from retrying in lockstep.

The policy below stays library-agnostic so the retry behavior is easy to test:

 1(ns acme.http.resilience)
 2
 3(def retryable-statuses #{429 502 503 504})
 4
 5(defn retryable-result? [{:keys [status transport-error?]}]
 6  (or transport-error?
 7      (contains? retryable-statuses status)))
 8
 9(defn delay-ms [base-ms cap-ms attempt]
10  (let [exp-delay (min cap-ms
11                       (long (* base-ms (Math/pow 2 (dec attempt)))))
12        jitter (rand-int 250)]
13    (+ exp-delay jitter)))
14
15(defn with-retries [request-fn {:keys [max-attempts base-ms cap-ms]
16                                :or {max-attempts 3
17                                     base-ms 200
18                                     cap-ms 2000}}]
19  (loop [attempt 1]
20    (let [result (request-fn)]
21      (if (or (not (retryable-result? result))
22              (= attempt max-attempts))
23        result
24        (do
25          (Thread/sleep (delay-ms base-ms cap-ms attempt))
26          (recur (inc attempt)))))))

This pattern is intentionally small. In production, the policy usually also needs:

  • request deadlines
  • structured logging
  • metrics per dependency
  • retry-budget tracking
  • protection against retrying very slow, large, or non-idempotent operations

Add Idempotency Keys to Retryable Writes

When a write is safe to retry, make that safety explicit:

1(ns acme.orders.client
2  (:import [java.util UUID]))
3
4(defn create-order-request [order]
5  {:method :post
6   :uri "https://orders.internal.example/orders"
7   :headers {"content-type" "application/json"
8             "idempotency-key" (str (UUID/randomUUID))}
9   :body order})

That does not magically make the workflow safe. The downstream service still has to honor the key and deduplicate repeated attempts. But the caller is now participating in a real retry protocol instead of merely hoping duplicate POSTs will be harmless.

Circuit Breaking and Load Shedding

Retries are helpful only while the dependency still has a chance to recover. When failure is sustained, callers should stop amplifying it.

That is where circuit breaking and load shedding matter:

  • Circuit breaking stops repeatedly calling a dependency that is currently failing.
  • Load shedding rejects or de-prioritizes lower-value work before the caller collapses.
  • Bulkheads isolate one dependency’s failure from unrelated work.

The important idea is not a specific library. It is recognizing when “retry harder” becomes the wrong move.

Observability Must Follow the Policy

If you cannot answer these questions, your retry policy is under-instrumented:

  • Which dependency is failing?
  • How many attempts were made?
  • How long did the caller wait overall?
  • Which status codes or transport errors triggered retries?
  • Did the circuit open?
  • Which user or workflow paths were affected?

A retry policy without metrics and structured logs is hard to tune and hard to defend in an incident review.

A Good Operational Flow

    flowchart TD
	    A["Call Dependency"] --> B{"Success?"}
	    B -- Yes --> C["Return Response"]
	    B -- No --> D{"Retryable and Safe?"}
	    D -- No --> E["Fail Fast with Context"]
	    D -- Yes --> F{"Retry Budget Left?"}
	    F -- No --> E
	    F -- Yes --> G["Backoff with Jitter"]
	    G --> H{"Circuit Open?"}
	    H -- Yes --> E
	    H -- No --> A

The key thing to notice is that retry is only one branch. Safety, budget, and circuit state all come first.

Key Takeaways

  • Retry only the failures that are both likely to succeed on replay and safe to replay.
  • Put timeout policy beside retry policy, not after it.
  • Use capped exponential backoff with jitter to avoid retry storms.
  • Make retryable writes explicitly idempotent, usually with workflow IDs or idempotency keys.
  • Instrument retries, circuits, and dependency failure modes so the policy can be tuned under real load.

References and Further Reading

Ready to Test Your Knowledge?

Loading quiz…
Revised on Thursday, April 23, 2026