Final Design
Final WhatsApp architecture — all deep dive decisions reflected
This is the complete system after all deep dives. Every component here was justified through an interview session. Nothing is added speculatively.
Architecture diagram
flowchart TD
subgraph Clients
CA[Client A - Mobile]
CB[Client B - Mobile]
end
subgraph Connection Layer
APIGW[API Gateway\nHTTP → WS upgrade]
LB1[Load Balancer]
CS1[Connection Server 1]
CS2[Connection Server 2]
CSN[Connection Server N\n500 total · 1M conns each]
end
subgraph App Layer
LB2[Load Balancer]
AS[App Server Pool\nautoscale · in-memory queue · circuit breakers]
SEQ[Sequence Service]
end
subgraph Async Workers
REGW[Registry Workers\nKafka consumers]
DELW[Delivery Workers\npoll pending_deliveries]
PUSH[Push Notification Service]
end
subgraph Redis Cluster
REGISTRY[(Redis - Connection Registry\nuser→server · server:users reverse set)]
INBOX[(Redis - Inbox Sorted Sets\n10 sharded primaries + read replicas\nuser→conv_id+timestamp ZSET)]
PROFILES[(Redis - Profile Cache\nuser→name+avatar+status · TTL 1h+jitter)]
RATELIMIT[(Redis - Rate Limiting\nrate:user_id INCR · TTL 1s)]
SEQREDIS[(Redis - Sequence Counters\nconv_id→seq counter)]
end
subgraph Storage
DDB[(DynamoDB\nmessages · pending_deliveries\nmessage_status · conversations\nusers)]
S3[(S3\nmedia + cold messages after 30d)]
CDN[CDN\nmedia delivery]
end
subgraph Kafka
KAFKA[Kafka\nregistry-updates topic]
end
CA -- WebSocket --> APIGW
CB -- WebSocket --> APIGW
APIGW --> LB1
LB1 --> CS1
LB1 --> CS2
LB1 --> CSN
CS1 -- on connect: publish event --> KAFKA
CS2 -- on connect: publish event --> KAFKA
KAFKA --> REGW
REGW -- HSET + SADD --> REGISTRY
CS1 -- HTTP POST message --> LB2
CS2 -- HTTP POST message --> LB2
LB2 --> AS
AS -- INCR rate:user_id --> RATELIMIT
AS -- INCR conv seq --> SEQREDIS
SEQ --> SEQREDIS
AS -- write message --> DDB
AS -- write pending_deliveries --> DDB
AS -- write message_status --> DDB
AS -- update conversations --> DDB
AS -- ZADD inbox --> INBOX
AS -- lookup registry --> REGISTRY
AS -- route to online user --> CS2
CS2 -- deliver message --> CB
CB -- delivery ack --> CS2
CS2 -- ack --> AS
AS -- update message_status --> DDB
DELW -- poll pending_deliveries --> DDB
DELW -- lookup registry --> REGISTRY
DELW -- route --> CS1
DELW -- push if offline --> PUSH
PUSH --> APNS[APNs / FCM]
AS -- GET/SET profile --> PROFILES
AS -- ZREVRANGE inbox --> INBOX
DDB -- cold after 30d --> S3
CA -- media upload via presigned URL --> S3
S3 -- media delivery --> CDN
CDN -- serve media --> CB
Decision log — what was decided and why
Connection Layer
| Decision |
Rationale |
| WebSocket over HTTP polling |
Persistent connection for real-time delivery. HTTP polling wastes battery and adds latency. |
| 500 connection servers at 1M conns each |
500M DAU × 20% peak = 100M concurrent. 100M / 1M per server = 100 servers minimum. 500 for headroom. |
| Exponential backoff + jitter on reconnect |
1M clients reconnecting simultaneously after a server crash = thundering herd. Jitter spreads the spike. |
| Registry writes async via Kafka |
500M simultaneous registry writes on New Year's midnight overwhelms Redis. Kafka queues absorb the spike. |
| Reverse mapping: server:users Redis Set |
Enables O(1) bulk cleanup of stale entries when a connection server dies — no full registry scan. |
App Layer
| Decision |
Rationale |
| In-memory queue + thread pool on app server |
Absorbs 2-3 minute gap before auto-scaling kicks in. Queue full → 429 → client retries. |
| Circuit breakers to DynamoDB and Redis |
Prevents cascade failures. Threads don't hang on dead dependencies. Fail fast, stay alive. |
| Rate limiting via centralised Redis INCR |
Per-server counters are bypassable (reconnect to different server). Centralised counter is correct. |
| Rate limit response as WS error message |
Can't return HTTP 429 over an established WebSocket. Error goes back on the same channel. |
Storage
| Decision |
Rationale |
| DynamoDB for messages |
Predictable low-latency at any scale. PK: conv_id, SK: seq_id for range queries. |
| S3 cold tier after 30 days |
95% of reads are recent messages. Cold storage cuts costs without impacting active users. |
| Sequence service (Redis INCR per conv) |
Monotonically increasing IDs within a conversation ensure correct message ordering. |
| Conversations GSI: (participant, last_message_at) |
Enables efficient inbox query — top K conversations per user sorted by recency. |
| pending_deliveries table |
Durable offline message queue. Delivery worker polls and flushes on reconnect. |
Caching
| Decision |
Rationale |
| Profile cache (Redis, TTL 1h + jitter) |
Profiles read on every inbox load, change rarely. Cache eliminates K DynamoDB reads per inbox open. |
| Sync DEL on profile update |
Profile updates are rare. Synchronous DEL + 1ms is simpler than Kafka/outbox for one cache key. |
| Inbox sorted sets (10 sharded primaries) |
Single primary can't handle ~1M writes/second at New Year's midnight. Shard by user_id % 10. |
| Read replicas per inbox shard |
Spreads 500M inbox reads across the replica fleet. Millisecond replication lag is acceptable. |
| TTL extended to 26h for known events |
Prevents cold start compounding on top of thundering herd during predictable high-traffic windows. |
| Request coalescing on profile cache miss |
Profile cache failure → cold start on DynamoDB. Coalescing limits each unique profile to 1 DB read per app server at a time. |
Fault Isolation
| Decision |
Rationale |
| Stale registry → offline delivery fallback |
Dead server routing fails → treat as offline → pending_deliveries → delivered on registry recovery. |
| Monitoring + cleaner for registry cleanup |
Reverse mapping lets cleaner bulk-delete 1M stale entries in one Redis Set read — no full scan. |
| DynamoDB circuit breaker (threshold: 1% error rate over 30s) |
SLO-driven threshold. Opens before SLO breach. Half-open probes recovery. |
| Inbox Redis failure → DynamoDB GSI fallback |
Inbox sorted set is a cache. Conversations table is source of truth. Slightly slower inbox load, not an outage. |
| Rate limit Redis fails open |
Temporary loss of rate limiting is better than blocking all message sends during Redis outage. |
| Sequence Redis falls back to timestamp ordering |
Slightly weaker ordering acceptable. Message delivery must not be blocked by a sequence counter failure. |
Capacity summary at 500M DAU
| Component |
Count / Size |
| Connection servers |
500 |
| App servers |
Auto-scaled, ~1,000 at peak |
| DynamoDB |
Auto-scales |
| Redis - Connection Registry |
1 cluster (~12.5GB) |
| Redis - Inbox sorted sets |
10 primaries + replicas (~12.5GB total) |
| Redis - Profile cache |
1 cluster (~125GB) |
| Redis - Rate limiting |
1 cluster (tiny — 1s TTL keys) |
| Redis - Sequence counters |
1 cluster (tiny) |
| S3 - Media |
Unbounded, tiered |
| Kafka |
registry-updates topic, sized for 500M events/hr at peak |