Final Design

Final WhatsApp architecture — all deep dive decisions reflected

This is the complete system after all deep dives. Every component here was justified through an interview session. Nothing is added speculatively.

Architecture diagram¶

flowchart TD
    subgraph Clients
        CA[Client A - Mobile]
        CB[Client B - Mobile]
    end

    subgraph Connection Layer
        APIGW[API Gateway\nHTTP → WS upgrade]
        LB1[Load Balancer]
        CS1[Connection Server 1]
        CS2[Connection Server 2]
        CSN[Connection Server N\n500 total · 1M conns each]
    end

    subgraph App Layer
        LB2[Load Balancer]
        AS[App Server Pool\nautoscale · in-memory queue · circuit breakers]
        SEQ[Sequence Service]
    end

    subgraph Async Workers
        REGW[Registry Workers\nKafka consumers]
        DELW[Delivery Workers\npoll pending_deliveries]
        PUSH[Push Notification Service]
    end

    subgraph Redis Cluster
        REGISTRY[(Redis - Connection Registry\nuser→server · server:users reverse set)]
        INBOX[(Redis - Inbox Sorted Sets\n10 sharded primaries + read replicas\nuser→conv_id+timestamp ZSET)]
        PROFILES[(Redis - Profile Cache\nuser→name+avatar+status · TTL 1h+jitter)]
        RATELIMIT[(Redis - Rate Limiting\nrate:user_id INCR · TTL 1s)]
        SEQREDIS[(Redis - Sequence Counters\nconv_id→seq counter)]
    end

    subgraph Storage
        DDB[(DynamoDB\nmessages · pending_deliveries\nmessage_status · conversations\nusers)]
        S3[(S3\nmedia + cold messages after 30d)]
        CDN[CDN\nmedia delivery]
    end

    subgraph Kafka
        KAFKA[Kafka\nregistry-updates topic]
    end

    CA -- WebSocket --> APIGW
    CB -- WebSocket --> APIGW
    APIGW --> LB1
    LB1 --> CS1
    LB1 --> CS2
    LB1 --> CSN

    CS1 -- on connect: publish event --> KAFKA
    CS2 -- on connect: publish event --> KAFKA
    KAFKA --> REGW
    REGW -- HSET + SADD --> REGISTRY

    CS1 -- HTTP POST message --> LB2
    CS2 -- HTTP POST message --> LB2
    LB2 --> AS

    AS -- INCR rate:user_id --> RATELIMIT
    AS -- INCR conv seq --> SEQREDIS
    SEQ --> SEQREDIS

    AS -- write message --> DDB
    AS -- write pending_deliveries --> DDB
    AS -- write message_status --> DDB
    AS -- update conversations --> DDB
    AS -- ZADD inbox --> INBOX
    AS -- lookup registry --> REGISTRY

    AS -- route to online user --> CS2
    CS2 -- deliver message --> CB
    CB -- delivery ack --> CS2
    CS2 -- ack --> AS
    AS -- update message_status --> DDB

    DELW -- poll pending_deliveries --> DDB
    DELW -- lookup registry --> REGISTRY
    DELW -- route --> CS1
    DELW -- push if offline --> PUSH
    PUSH --> APNS[APNs / FCM]

    AS -- GET/SET profile --> PROFILES
    AS -- ZREVRANGE inbox --> INBOX

    DDB -- cold after 30d --> S3
    CA -- media upload via presigned URL --> S3
    S3 -- media delivery --> CDN
    CDN -- serve media --> CB

Decision log — what was decided and why¶

Connection Layer¶

Decision	Rationale
WebSocket over HTTP polling	Persistent connection for real-time delivery. HTTP polling wastes battery and adds latency.
500 connection servers at 1M conns each	500M DAU × 20% peak = 100M concurrent. 100M / 1M per server = 100 servers minimum. 500 for headroom.
Exponential backoff + jitter on reconnect	1M clients reconnecting simultaneously after a server crash = thundering herd. Jitter spreads the spike.
Registry writes async via Kafka	500M simultaneous registry writes on New Year's midnight overwhelms Redis. Kafka queues absorb the spike.
Reverse mapping: server:users Redis Set	Enables O(1) bulk cleanup of stale entries when a connection server dies — no full registry scan.

App Layer¶

Decision	Rationale
In-memory queue + thread pool on app server	Absorbs 2-3 minute gap before auto-scaling kicks in. Queue full → 429 → client retries.
Circuit breakers to DynamoDB and Redis	Prevents cascade failures. Threads don't hang on dead dependencies. Fail fast, stay alive.
Rate limiting via centralised Redis INCR	Per-server counters are bypassable (reconnect to different server). Centralised counter is correct.
Rate limit response as WS error message	Can't return HTTP 429 over an established WebSocket. Error goes back on the same channel.

Storage¶

Decision	Rationale
DynamoDB for messages	Predictable low-latency at any scale. PK: conv_id, SK: seq_id for range queries.
S3 cold tier after 30 days	95% of reads are recent messages. Cold storage cuts costs without impacting active users.
Sequence service (Redis INCR per conv)	Monotonically increasing IDs within a conversation ensure correct message ordering.
Conversations GSI: (participant, last_message_at)	Enables efficient inbox query — top K conversations per user sorted by recency.
pending_deliveries table	Durable offline message queue. Delivery worker polls and flushes on reconnect.

Caching¶

Decision	Rationale
Profile cache (Redis, TTL 1h + jitter)	Profiles read on every inbox load, change rarely. Cache eliminates K DynamoDB reads per inbox open.
Sync DEL on profile update	Profile updates are rare. Synchronous DEL + 1ms is simpler than Kafka/outbox for one cache key.
Inbox sorted sets (10 sharded primaries)	Single primary can't handle ~1M writes/second at New Year's midnight. Shard by user_id % 10.
Read replicas per inbox shard	Spreads 500M inbox reads across the replica fleet. Millisecond replication lag is acceptable.
TTL extended to 26h for known events	Prevents cold start compounding on top of thundering herd during predictable high-traffic windows.
Request coalescing on profile cache miss	Profile cache failure → cold start on DynamoDB. Coalescing limits each unique profile to 1 DB read per app server at a time.

Fault Isolation¶

Decision	Rationale
Stale registry → offline delivery fallback	Dead server routing fails → treat as offline → pending_deliveries → delivered on registry recovery.
Monitoring + cleaner for registry cleanup	Reverse mapping lets cleaner bulk-delete 1M stale entries in one Redis Set read — no full scan.
DynamoDB circuit breaker (threshold: 1% error rate over 30s)	SLO-driven threshold. Opens before SLO breach. Half-open probes recovery.
Inbox Redis failure → DynamoDB GSI fallback	Inbox sorted set is a cache. Conversations table is source of truth. Slightly slower inbox load, not an outage.
Rate limit Redis fails open	Temporary loss of rate limiting is better than blocking all message sends during Redis outage.
Sequence Redis falls back to timestamp ordering	Slightly weaker ordering acceptable. Message delivery must not be blocked by a sequence counter failure.

Capacity summary at 500M DAU¶

Component	Count / Size
Connection servers	500
App servers	Auto-scaled, ~1,000 at peak
DynamoDB	Auto-scales
Redis - Connection Registry	1 cluster (~12.5GB)
Redis - Inbox sorted sets	10 primaries + replicas (~12.5GB total)
Redis - Profile cache	1 cluster (~125GB)
Redis - Rate limiting	1 cluster (tiny — 1s TTL keys)
Redis - Sequence counters	1 cluster (tiny)
S3 - Media	Unbounded, tiered
Kafka	registry-updates topic, sized for 500M events/hr at peak