Edge Cases
The edge cases that break the happy path
The salting mechanism works cleanly when everything is running. These are the scenarios where something goes wrong — a user was offline when the conversation was salted, Redis restarts, or a new app server comes up cold.
Edge case 1 — User was offline during salting¶
Scenario:
- conv_abc123 was normal (N=1) when Bob went offline
- While Bob was offline, the conversation got hot — N bumped to 4
- Messages were written to conv_abc123#0 through conv_abc123#3
- Bob reconnects
What happens without the fix:
Bob's client requests chat history. If the app server handling Bob's request doesn't know about the salting, it queries only conv_abc123 (N=1) and finds nothing — all messages were written to the salted partitions. Bob sees an empty chat or missing messages.
The fix:
The registry lookup happens at the app server on every read — not at the client. Bob's client just asks for conv_abc123 history. The app server:
1. GET registry[conv_abc123] → max_N = 4
2. Scatter-gather across conv_abc123#0 through #3
3. Return complete history to Bob
Bob being offline during the salting is irrelevant — the registry is the source of truth, and it's always consulted on every read. Bob's client is completely unaware of salting.
Edge case 2 — Redis restarts¶
Scenario: Redis holding the hot partition registry crashes and restarts.
What happens: - Redis with AOF replays the log and recovers all registry entries - Recovery time depends on AOF file size — for a registry with millions of entries, replay takes seconds to minutes - During recovery, app servers get null from registry lookups → treat all conversations as N=1 → queries miss salted partitions
The fix — warm-up from DynamoDB: On Redis restart, before serving traffic, the hot partition service rebuilds the registry from a backup:
Option A: DynamoDB backup table
→ On every registry update, also write to a DynamoDB table: conversation_id → max_N
→ On Redis restart: scan DynamoDB backup → repopulate Redis
→ Slower recovery but zero data loss
Option B: AOF replay (no backup needed)
→ Redis replays AOF on restart automatically
→ If AOF is intact, full recovery with no manual intervention
→ AOF corruption is rare but possible
For production, use both — AOF as the fast path, DynamoDB backup as the fallback if AOF is corrupted.
During the recovery window: App servers should detect Redis unavailability and fall back to reading from the DynamoDB backup table directly. Slower (~5-10ms instead of ~1ms) but correct.
Edge case 3 — New app server comes up cold¶
Scenario: A new app server is added to the fleet (auto-scaling during peak traffic). It has no local state — no counters, no cached registry entries.
What happens: - Registry lookups: the new server queries Redis fresh on every request → correct, no issue - Local WPS counters: start at 0 → the server will under-detect hot conversations until counters build up
The fix: The local WPS counter is a detection mechanism, not a routing mechanism. Under-detection on a new server means it takes slightly longer to detect that a conversation is hot — but the registry already has the correct max_N from when the conversation was first detected as hot by other servers. Writes and reads are routed correctly regardless. The new server's detection will catch up within a few seconds as traffic flows through it.
Edge case 4 — Hot partition service crashes¶
Scenario: The service that consumes from the Redis Stream and updates the registry goes down.
What happens: - Existing registry entries remain intact — max_N doesn't decrease, existing salting still works - New hot conversations are not detected — their N stays at 1 even as they exceed 1,000 WPS - DynamoDB throttling begins for newly hot conversations
The fix:
- Run multiple instances of the hot partition service — if one crashes, others continue consuming from the Redis Stream
- Redis Stream retains unprocessed events — when the service recovers, it replays missed events and catches up
- Add alerting on DynamoDB throttle metrics — a sudden spike in ProvisionedThroughputExceeded errors signals the hot partition service may be down
Summary¶
| Edge Case | Risk | Fix |
|---|---|---|
| User offline during salting | Missing messages on reconnect | Registry always consulted on read — client unaware of salting |
| Redis restart | Registry unavailable during recovery | AOF replay (fast) + DynamoDB backup (fallback) |
| New app server cold start | Slow hot detection | Doesn't affect routing — registry in Redis is authoritative |
| Hot partition service crash | New hot conversations not detected | Multiple instances + Redis Stream replay on recovery |