Aggregation and Scatter-Gather

A practical lesson on fan-out and result collection patterns, including completion policies, partial results, timeouts, and the trade-offs in composite responses.

Aggregation and scatter-gather are integration patterns for situations where one input triggers parallel work across several downstream systems and the platform later combines the results. This is useful when no single service owns the full answer. Search, quote comparison, recommendation blending, composite dashboards, and certain workflow-status views often need this structure.

The architectural value is parallelism and composition. The cost is coordination. Once one request fans out, the system must decide when enough answers have arrived, what happens when some are late or missing, and whether partial answers are acceptable.

    flowchart LR
	    A["Incoming request or event"] --> B["Aggregator"]
	    B --> C["Service A"]
	    B --> D["Service B"]
	    B --> E["Service C"]
	    C --> F["Collect results"]
	    D --> F
	    E --> F
	    F --> G["Return full or partial aggregate"]

What to notice:

  • the pattern is not only fan-out; it also requires a completion rule
  • every downstream call can fail or time out independently
  • user expectations depend on whether partial results are acceptable

Scatter-Gather Needs a Completion Policy

The hardest question is often not “which services should receive the request?” It is “when do we stop waiting?” Common policies include:

  • wait for all expected replies
  • wait for a quorum
  • wait until a time budget expires
  • return partial results with completeness metadata

Different use cases justify different answers. A compliance workflow may require all responses. A shopping comparison page may prefer partial results over a slow blank page.

Aggregation Is a State Problem

The aggregator usually needs to track:

  • which downstream calls were sent
  • which replies have arrived
  • which have failed or timed out
  • whether the response is complete enough to publish
1{
2  "correlationId": "search_441",
3  "expectedReplies": 3,
4  "receivedReplies": 2,
5  "timedOutSources": ["inventory"],
6  "status": "partial"
7}

This is why aggregation is not just “send three requests in parallel.” The system needs a durable or at least explicit model of collection state.

When Partial Success Is the Right Answer

A strong scatter-gather design often accepts that partial success is not always a failure. If two sources respond and one times out, the best user or downstream experience may be:

  • return partial data plus a completeness signal
  • continue background collection for later refresh
  • degrade gracefully by source importance

What matters is that the completion rule is intentional. Systems become brittle when they implicitly wait for perfection even when the business would prefer a good-enough answer quickly.

 1aggregationPolicy:
 2  correlationKey: quoteRequestId
 3  expectedSources:
 4    - pricing
 5    - inventory
 6    - promotions
 7  completion:
 8    mode: partial-after-time-budget
 9    timeoutMs: 1500
10  responseShape:
11    includeCompleteness: true

Where the Pattern Can Go Wrong

Scatter-gather becomes a smell when:

  • too many downstream services are needed for one simple answer
  • the aggregator turns into a central bottleneck
  • timeouts are undefined or inconsistent
  • partial results are possible but never modeled explicitly
  • the pattern hides a domain that should have its own projection or composed read model

Sometimes repeated aggregation at request time is the wrong answer. A prebuilt projection or cached composite view may be simpler and more reliable.

Common Mistakes

  • fan-out without a clear completion policy
  • assuming all downstream replies are equally critical
  • waiting indefinitely instead of defining timeout and partial-result behavior
  • rebuilding the same expensive aggregation on every request rather than materializing a view
  • failing to expose completeness state to callers or operators

Design Review Question

A product search endpoint fans out to six services and waits for every reply before responding. One low-priority recommendation service often lags and causes the whole page to stall. What should you challenge first?

Challenge the completion policy, not just the lagging service. If the recommendation source is optional, the stronger design may return partial results within a time budget and surface completeness explicitly instead of making one low-priority dependency block the whole aggregate.

Quiz Time

Loading quiz…
Revised on Thursday, April 23, 2026