Skip to content

Client Reconnect

Connection server failure — what the client experiences

When a connection server dies, the clients connected to it don't get a graceful notification. The TCP socket goes dead, and the client detects this through heartbeats and reconnects automatically.


The failure scenario

Connection server 7 goes down — hard crash, OOM kill, power cut, network partition. It gets no chance to notify anyone. 1 million users were connected to it.

From the client's perspective, the WebSocket connection goes silent.


How the client detects the failure

WhatsApp clients send a heartbeat ping to the connection server every N seconds (typically 30-60 seconds on mobile to preserve battery). The connection server responds with a pong.

When server 7 dies:

Client → ping → server 7
Server 7: dead, no response
Client: no pong received within timeout
→ Client marks connection as dead
→ Client initiates reconnect

In many cases the TCP socket itself breaks immediately when the server process dies — the OS sends a TCP RST or FIN. The client's socket library detects this without waiting for a heartbeat timeout.

Server 7 process killed
→ OS sends TCP RST to all 1M connected clients
→ Client socket libraries detect broken connection immediately
→ All 1M clients begin reconnect simultaneously

The heartbeat is the fallback for silent failures — network partitions where the connection appears alive but packets are being dropped.


The reconnect flow

The client reconnects to the same entry point — the load balancer. It has no knowledge of individual connection servers.

Client detects dead connection
→ Client connects to load balancer (same URL as original connection)
→ LB health checks have detected server 7 is down (within seconds)
→ LB routes to any healthy server (e.g. server 3)
→ New WebSocket established on server 3
→ Auth token validated
→ Registry update queued: user:alice → server3
→ Alice is back online

The client doesn't need to know which server died or why. It just reconnects to the LB and gets routed to a healthy server automatically.

Interview framing

"Clients detect the dead connection via TCP RST or heartbeat timeout. They reconnect to the load balancer — which has already stopped routing to the dead server via health checks. The client is back online within seconds, no manual intervention needed."