Metrics, Tracing, and Dependency Visibility

March 23, 2026

Describe latency metrics, error-rate monitoring, cold-start visibility, trigger lag, downstream dependency health, and distributed tracing across asynchronous workflows.

Metrics and tracing make serverless behavior visible at scale. Logs explain individual events. Metrics explain shape over time. Traces explain how one request or event moved through several components. In a serverless system, all three are needed because failures often emerge from interactions between many small functions, managed triggers, and downstream dependencies.

The goal is not to instrument everything equally. The goal is to instrument the paths that reveal user impact, dependency health, retry behavior, and backlog growth early enough to act.

    flowchart LR
	    A["API or event source"] --> B["Function A"]
	    B --> C["Queue or workflow"]
	    C --> D["Function B"]
	    D --> E["Dependency"]
	    B -.metrics.-> F["Metrics backend"]
	    D -.trace spans.-> G["Tracing backend"]
	    E -.health signals.-> F

What to notice:

metrics show trends such as latency, errors, and lag
traces show one path through the system
dependency visibility matters as much as function visibility

What to Measure

Strong serverless metrics usually include:

invocation count
success and error rates
latency percentiles
cold-start frequency when it materially affects user paths
queue depth or stream lag
retry count
downstream dependency latency and error rate

A common mistake is to monitor only function errors. A system can have near-zero function exceptions and still be unhealthy because of lag, throttling, timeout growth, or downstream slowness.

 1alerts:
 2  - name: payment-api-p95-latency
 3    metric: p95_latency_ms
 4    threshold: 1200
 5  - name: invoice-queue-lag
 6    metric: queue_oldest_message_age_seconds
 7    threshold: 300
 8  - name: partner-api-error-rate
 9    metric: dependency_error_rate
10    threshold: 0.05

Tracing Across Asynchronous Hops

Tracing gets harder in serverless because the path often crosses queues, events, and workflow transitions. The trace context must be propagated deliberately. If one function emits a message without trace metadata, the next function begins a disconnected story.

1export async function publishInvoiceEvent(event: { invoiceId: string }, traceId: string) {
2  await eventBus.publish({
3    type: "invoice.generated",
4    traceId,
5    payload: event,
6  });
7}

What this demonstrates:

trace context is treated as part of the event contract
asynchronous boundaries need explicit propagation
one missing hop can break end-to-end visibility

Dependency Visibility Matters

A function may be healthy from the platform’s point of view while still failing its purpose because:

a partner API is timing out
a database is throttling
a queue is backlogged
a workflow engine is retrying excessively

That is why dashboards should combine function metrics with dependency metrics. Operators need to know whether the problem is in the function code, the trigger, or the system it depends on.

Common Mistakes

watching only invocation errors and ignoring lag, throttling, or dependency latency
treating cold starts as the only performance issue
failing to propagate trace context across async boundaries
creating dashboards full of low-signal metrics with no user-facing meaning

Design Review Question

A serverless reporting system shows normal invocation counts and low exception rates, but users complain that reports are often hours late. What is the most likely observability gap?

The stronger answer is missing lag and dependency visibility. The system may be processing without throwing exceptions, yet queue age, stream lag, throttling, or workflow retries could be pushing completion far outside user expectations. Function error count alone is not enough.

Check Your Understanding

Loading quiz…

Revised on Wednesday, June 3, 2026

12.1 Logging and Structured Telemetry

12.3 Debugging Distributed Serverless Systems