Failure Isolation and Partial Success

March 23, 2026

A practical lesson on isolating event-handling faults, limiting blast radius, and designing for cases where some work succeeds while other work must retry or escalate.

Failure isolation is the difference between one bad dependency causing a manageable incident and one bad dependency turning into a platform-wide backlog. Event-driven systems often look decoupled at the topology level, yet still fail in a coupled way because one handler performs too many actions, shares too much execution state, or has no clear boundary between essential and optional work.

The key question is not whether failure will happen. It will. The key question is whether one failed part of the handling path can be isolated without losing all useful progress. This is where partial success becomes a design topic. In real systems, some effects may complete while others should be retried, quarantined, or handed off to another path.

    flowchart TD
	    A["Event received"] --> B["Core business state update"]
	    B --> C{"Optional downstream actions"}
	    C -->|All succeed| D["Acknowledge complete"]
	    C -->|Some fail| E["Persist success state and isolate failed follow-up"]
	    E --> F["Retry or compensating path"]

What to notice:

not every downstream action has equal importance
the architecture should separate core completion from optional follow-up when possible
failure isolation is often a boundary and workflow design problem, not only a retry setting

Why Coupled Handlers Become Fragile

A single consumer that writes core business state, sends two external notifications, updates analytics, and calls a partner API may look efficient on a whiteboard. In production it becomes difficult to answer basic questions:

which of those effects are required before acknowledgment
which ones may retry independently
which failures should block the whole event
which actions can be moved to a downstream event or outbox step

The more unrelated work one handler owns, the harder partial-success reasoning becomes.

Separate Essential from Non-Essential Effects

One of the strongest reliability moves is to separate the effect that defines successful business handling from secondary follow-up work. For example, an order-confirmation flow may define “order stored durably” as the core completion point, while analytics publication and customer-email dispatch become downstream effects driven by later events or separate handlers.

That separation reduces blast radius. A reporting outage should not necessarily block order capture. A partner webhook failure should not always force the system to pretend the original business event never happened.

1async function handleOrderPlaced(event: OrderPlacedEvent) {
2  await orderRepository.save(event.data);
3  await outbox.publish({
4    eventName: "order.accepted",
5    data: { orderId: event.data.orderId }
6  });
7}

This example is deliberately small. The point is architectural: the handler commits core state and emits a follow-up event rather than synchronously owning every consequence itself.

Partial Success Needs Explicit Recording

Partial success is only safe when it is explicit. If one part succeeds and another fails, the system needs durable evidence of what happened. Otherwise operators and replay logic cannot tell whether the event should be retried fully, partially, or not at all.

This often means recording per-step status, outbox entries, compensating tasks, or a workflow state record.

1processingResult:
2  eventId: evt_7a2
3  coreStateUpdate: succeeded
4  partnerWebhook: failed
5  analyticsPublication: skipped
6  nextAction: retry_partner_webhook

A record like this turns an ambiguous failure into an operable state.

Isolate Failures by Boundary

Useful isolation techniques include:

separate consumers for unrelated responsibilities
outbox or follow-up events for secondary effects
bounded retries per dependency rather than one shared retry loop
circuit breakers or backpressure when one downstream path degrades
distinct queues or topics when workloads have different urgency or failure profiles

The right technique depends on topology, but the goal is consistent: one failing path should not drag every other path into the same failure mode.

Common Mistakes

bundling several required and optional side effects into one all-or-nothing handler
retrying the full event when only one downstream step failed
failing to persist which steps already completed
assuming partial success is a bug instead of a normal design condition
letting low-value follow-up work block high-value business state changes

Design Review Question

A consumer writes the core booking record successfully, then fails while notifying an external loyalty partner. The team proposes retrying the entire handler from the start until everything succeeds. Why is that a weak default?

It is weak because it treats already-completed core work and failed secondary work as one inseparable unit. That raises duplicate-write risk, complicates observability, and increases blast radius. The stronger design would record that the booking succeeded, isolate the loyalty notification as a retryable follow-up, and avoid repeating the core business effect unnecessarily.

Quiz Time

Loading quiz…

Revised on Wednesday, June 3, 2026

7.3 Dead-Letter Queues