Describe latency metrics, error-rate monitoring, cold-start visibility, trigger lag, downstream dependency health, and distributed tracing across asynchronous workflows.
Metrics and tracing make serverless behavior visible at scale. Logs explain individual events. Metrics explain shape over time. Traces explain how one request or event moved through several components. In a serverless system, all three are needed because failures often emerge from interactions between many small functions, managed triggers, and downstream dependencies.
The goal is not to instrument everything equally. The goal is to instrument the paths that reveal user impact, dependency health, retry behavior, and backlog growth early enough to act.
flowchart LR
A["API or event source"] --> B["Function A"]
B --> C["Queue or workflow"]
C --> D["Function B"]
D --> E["Dependency"]
B -.metrics.-> F["Metrics backend"]
D -.trace spans.-> G["Tracing backend"]
E -.health signals.-> F
What to notice:
Strong serverless metrics usually include:
A common mistake is to monitor only function errors. A system can have near-zero function exceptions and still be unhealthy because of lag, throttling, timeout growth, or downstream slowness.
1alerts:
2 - name: payment-api-p95-latency
3 metric: p95_latency_ms
4 threshold: 1200
5 - name: invoice-queue-lag
6 metric: queue_oldest_message_age_seconds
7 threshold: 300
8 - name: partner-api-error-rate
9 metric: dependency_error_rate
10 threshold: 0.05
Tracing gets harder in serverless because the path often crosses queues, events, and workflow transitions. The trace context must be propagated deliberately. If one function emits a message without trace metadata, the next function begins a disconnected story.
1export async function publishInvoiceEvent(event: { invoiceId: string }, traceId: string) {
2 await eventBus.publish({
3 type: "invoice.generated",
4 traceId,
5 payload: event,
6 });
7}
What this demonstrates:
A function may be healthy from the platform’s point of view while still failing its purpose because:
That is why dashboards should combine function metrics with dependency metrics. Operators need to know whether the problem is in the function code, the trigger, or the system it depends on.
A serverless reporting system shows normal invocation counts and low exception rates, but users complain that reports are often hours late. What is the most likely observability gap?
The stronger answer is missing lag and dependency visibility. The system may be processing without throwing exceptions, yet queue age, stream lag, throttling, or workflow retries could be pushing completion far outside user expectations. Function error count alone is not enough.