The Spike Problem

The problem

Estimation gives you average numbers. But URL shorteners don't have average traffic — they have spikes. A celebrity tweets a link and traffic goes 10x in seconds. The base architecture and caching layer were designed for average load. This deep dive is about surviving the spike.

The numbers¶

From estimation:

Average read load  → 100k reads/sec
Peak read load     → 1M reads/sec  (10x spike — viral link, celebrity tweet)

After caching (80% hit rate):

Average case:
  Total reads       → 100k/sec
  Cache hits (80%)  → 80k/sec  hitting Redis
  Cache misses      → 20k/sec  hitting DB

Peak case:
  Total reads       → 1M/sec
  Cache hits (80%)  → 800k/sec hitting Redis
  Cache misses      → 200k/sec hitting DB

What breaks at peak¶

Redis: A single Redis node handles roughly 100k–1M ops/sec. At 800k ops/sec, you are at the ceiling with zero headroom. Worse — all 800k of those hits might be for the same viral URL, all going to the same Redis node via consistent hashing. This is the hot key problem.

Database: 200k reads/sec reaching the DB at peak. A single Postgres instance handles 10k–50k reads/sec. Even with read replicas, you need enough replicas to absorb this load.

App servers: 1M requests/sec arriving at the system. A single app server handles maybe 10k-50k requests/sec depending on work per request. You need a fleet of app servers and something in front to distribute traffic.

The three problems to solve¶

1. Hot key problem    → one viral URL overwhelming one Redis node
2. DB read load       → 200k cache misses/sec at peak
3. Traffic distribution → 1M requests/sec needs a fleet + load balancer

Each gets its own solution. Together they handle the spike.

Next: The hot key problem — what it is, how to detect it, and two ways to fix it.