Other Redis Failures
Other Redis failures — registry, rate limiting, sequence counter
Three other Redis instances in the system. Each has a different criticality and a different fallback.
Connection Registry Redis¶
What it does: maps user IDs to their current connection server for message routing.
What breaks when it goes down: the app server can't look up which server a user is connected to. All users appear offline to the routing layer.
Fallback: the offline delivery system handles it cleanly. A registry miss already means "treat as offline" — the message goes to pending_deliveries. When the registry Redis recovers, the delivery worker resumes routing.
Registry Redis down
→ All registry lookups miss
→ All messages go to pending_deliveries
→ Registry Redis recovers
→ Clients reconnect (connections were unaffected — only lookup is broken)
→ Registry repopulated via Kafka consumers
→ Delivery workers drain pending_deliveries
→ All messages delivered
Users experience a delay — messages queue up during the outage and arrive in a burst once the registry recovers. No messages are lost.
Why this is survivable: the registry is not the source of truth for messages. It's only needed for routing. The pending_deliveries table holds the messages safely until routing is restored.
Rate Limiting Redis¶
What it does: stores per-user message counters (INCR with 1-second TTL) to enforce the 10 messages/second limit.
What breaks when it goes down: the app server can't check rate limit counters. Every INCR call fails.
Fallback — fail open: allow all messages through without rate limiting.
Rate limit Redis down
→ INCR call fails
→ App server: treat as rate limit not exceeded
→ Message allowed through
This is the correct choice. The alternative — fail closed (reject all messages) — would mean no user can send any message during the outage. That's far worse UX than temporarily losing rate limit enforcement.
The risk of failing open is a brief window where a malicious user can send unlimited messages. At WhatsApp scale, this is an acceptable trade-off for a short outage. Rate limiting Redis is a small, low-load instance — it rarely goes down, and if it does, it recovers quickly.
Sequence Counter Redis¶
What it does: generates monotonically increasing message IDs per conversation for ordering.
What breaks when it goes down: new messages can't get a sequence number. Message ordering within a conversation may be lost.
Fallback: use timestamp-based ordering as a fallback ID.
Sequence Redis down
→ Can't get seq_id for message
→ Fall back to: message_id = current_timestamp_ms + random_suffix
→ Messages ordered by timestamp
Timestamp ordering is slightly weaker than sequence numbers — two messages sent within the same millisecond have non-deterministic order. In practice this is extremely rare and users won't notice.
The system eventually becomes consistent — when sequence Redis recovers, new messages get proper sequence IDs again. Existing messages with timestamp IDs are already delivered and don't need re-ordering.
Fail open vs fail closed
Rate limiting Redis fails open — temporary loss of rate limiting is better than blocking all messages. Sequence counter Redis fails to timestamps — slightly weaker ordering is better than message delivery failure. Both choices prioritise availability over strict correctness for short outage windows.
Interview framing
"Registry Redis failure degrades to offline delivery — no messages lost, just delayed. Rate limiting Redis fails open — we lose rate limiting temporarily but keep message delivery working. Sequence Redis falls back to timestamp ordering — slightly weaker but delivery is unaffected."