A practical lesson on isolating event-handling faults, limiting blast radius, and designing for cases where some work succeeds while other work must retry or escalate.
Failure isolation is the difference between one bad dependency causing a manageable incident and one bad dependency turning into a platform-wide backlog. Event-driven systems often look decoupled at the topology level, yet still fail in a coupled way because one handler performs too many actions, shares too much execution state, or has no clear boundary between essential and optional work.
The key question is not whether failure will happen. It will. The key question is whether one failed part of the handling path can be isolated without losing all useful progress. This is where partial success becomes a design topic. In real systems, some effects may complete while others should be retried, quarantined, or handed off to another path.
flowchart TD
A["Event received"] --> B["Core business state update"]
B --> C{"Optional downstream actions"}
C -->|All succeed| D["Acknowledge complete"]
C -->|Some fail| E["Persist success state and isolate failed follow-up"]
E --> F["Retry or compensating path"]
What to notice:
A single consumer that writes core business state, sends two external notifications, updates analytics, and calls a partner API may look efficient on a whiteboard. In production it becomes difficult to answer basic questions:
The more unrelated work one handler owns, the harder partial-success reasoning becomes.
One of the strongest reliability moves is to separate the effect that defines successful business handling from secondary follow-up work. For example, an order-confirmation flow may define “order stored durably” as the core completion point, while analytics publication and customer-email dispatch become downstream effects driven by later events or separate handlers.
That separation reduces blast radius. A reporting outage should not necessarily block order capture. A partner webhook failure should not always force the system to pretend the original business event never happened.
1async function handleOrderPlaced(event: OrderPlacedEvent) {
2 await orderRepository.save(event.data);
3 await outbox.publish({
4 eventName: "order.accepted",
5 data: { orderId: event.data.orderId }
6 });
7}
This example is deliberately small. The point is architectural: the handler commits core state and emits a follow-up event rather than synchronously owning every consequence itself.
Partial success is only safe when it is explicit. If one part succeeds and another fails, the system needs durable evidence of what happened. Otherwise operators and replay logic cannot tell whether the event should be retried fully, partially, or not at all.
This often means recording per-step status, outbox entries, compensating tasks, or a workflow state record.
1processingResult:
2 eventId: evt_7a2
3 coreStateUpdate: succeeded
4 partnerWebhook: failed
5 analyticsPublication: skipped
6 nextAction: retry_partner_webhook
A record like this turns an ambiguous failure into an operable state.
Useful isolation techniques include:
The right technique depends on topology, but the goal is consistent: one failing path should not drag every other path into the same failure mode.
A consumer writes the core booking record successfully, then fails while notifying an external loyalty partner. The team proposes retrying the entire handler from the start until everything succeeds. Why is that a weak default?
It is weak because it treats already-completed core work and failed secondary work as one inseparable unit. That raises duplicate-write risk, complicates observability, and increases blast radius. The stronger design would record that the booking succeeded, isolate the loyalty notification as a retryable follow-up, and avoid repeating the core business effect unnecessarily.