Measuring Latency

You cannot store every latency measurement individually. Histograms let you keep the shape of the distribution while throwing away the raw numbers.

Why raw storage doesn't work¶

At peak, WhatsApp processes roughly 100K messages/second across the fleet:

100,000 messages/sec
× 86,400 seconds/day
= 8.64 billion data points per day

At 8 bytes each, that's ~69GB of raw latency data per day. Storing it is expensive. More critically, computing a percentile requires sorting all 8.64 billion values — that's not a real-time operation.

You need a better approach.

Histograms — keep the shape, discard the raw values¶

Each app server maintains a set of delivery latency buckets in memory. Every message that completes delivery increments exactly one counter based on how long it took.

Bucket          Counter
0-50ms:         61,000
50-100ms:       22,000
100-200ms:       9,500
200-500ms:       5,800
500ms-1s:        1,400
1s+:               300
--------------------------
Total:         100,000

Instead of 100,000 individual numbers, you store 6 integers. Incrementing a bucket is a single atomic operation — essentially free.

Computing p99 from a histogram¶

p99 means: the latency value below which 99% of messages fall. With 100,000 messages, you need the bottom 99,000.

Walk the buckets, accumulating a running total until you cross 99,000:

0-50ms:    61,000 → running total: 61,000
50-100ms:  22,000 → running total: 83,000
100-200ms:  9,500 → running total: 92,500
200-500ms:  5,800 → running total: 98,300
500ms-1s:   1,400 → running total: 99,700  ← 99,000 falls in here

p99 lands in the 500ms-1s bucket. SLO says < 500ms. You're breaching it. This would fire an alert.

The trade-off: you lose precision within the bucket. You know p99 is somewhere between 500ms and 1s, not the exact millisecond. For SLO tracking this is fine — you care whether you're above or below the threshold, not the exact value.

Merging histograms across the fleet¶

WhatsApp runs thousands of app servers. Each builds its own histogram independently. To get a fleet-wide p99, the metrics collector adds bucket counts:

Server 1:   0-50ms: 610   50-100ms: 220   100-200ms: 95 ...
Server 2:   0-50ms: 598   50-100ms: 215   100-200ms: 91 ...
Server 3:   0-50ms: 602   50-100ms: 218   100-200ms: 93 ...
...
Fleet total: 0-50ms: 61,000  50-100ms: 22,000 ...

Histograms are mergeable by design — just add the counters. This is why they're the standard tool for distributed latency measurement.

Leading indicators for delivery latency¶

Latency alone doesn't tell the full story. Leading indicators warn you before the SLO breaches:

Kafka consumer lag (registry updates)  — growing lag → users appear offline longer → delivery delays
DynamoDB write latency p99             — spikes here cascade into delivery latency
Pending_deliveries table depth         — growing backlog → delivery worker falling behind
Redis inbox read latency               — spike here slows every inbox load
Connection server queue depth          — backed up → messages waiting to be forwarded

These aren't SLIs, but a spike in any of them predicts a delivery latency SLO breach within minutes.

Interview framing

"Each app server maintains a latency histogram — bucket counters for 0-50ms, 50-100ms, 100-200ms, 200-500ms, 500ms-1s, 1s+. Prometheus scrapes all servers every 15 seconds and adds bucket counts to compute fleet-wide p99. Beyond the SLI, also track Kafka consumer lag and pending_deliveries depth as leading indicators — both predict delivery latency degradation before the SLO breaches."