Replication
Sharding splits data. Replication protects it.
Each shard is a single machine. If that machine dies, the chunk of data it owns becomes unavailable — and since URLs are durable (once created, never lost), that's a violation of a core NFR. Every shard needs replicas.
Why each shard needs replicas¶
Sharding gives you storage scale. But each shard is still a single point of failure. If Shard-3 goes down:
Without replication:
All short codes that hash to Shard-3 → unavailable
Redirects return 404
Users cannot follow links → business impact
Replication means each shard has multiple copies of its data on different machines. If the primary fails, a secondary takes over.
The replication setup — 3 replicas per shard¶
The standard production setup is 1 primary + 2 secondaries per shard:
Shard-1:
Primary → handles all writes
Secondary → replicates from primary, handles reads
Secondary → replicates from primary, handles reads (spare for failover)
Why 3 total (not 2)?
With 2 replicas (1 primary + 1 secondary): - Primary dies → secondary becomes primary → now you have only 1 machine → zero fault tolerance - Before you can provision a new replica, you're one machine failure away from data loss
With 3 replicas (1 primary + 2 secondaries): - Primary dies → one secondary becomes primary → you still have 1 secondary → fault tolerant while you provision a replacement
3 replicas is the minimum for production durability. This is why Kafka, Cassandra, MongoDB, and most distributed systems default to a replication factor of 3.
Reads and writes with replication¶
Writes → always go to the primary
Primary replicates to secondaries asynchronously
Reads → can go to primary or secondaries
Secondaries may be slightly stale (async replication lag)
For a URL shortener: - Redirect reads go to secondaries — slight staleness is acceptable. A URL created 1 second ago might not be on the secondary yet, but that's a rare edge case covered by read-your-own-writes (next file). - Creation writes go to the primary.
This also helps with read throughput. With 2 secondaries per shard, read traffic is spread across 3 machines per shard instead of 1. Combined with caching, this comfortably handles the remaining 20k DB reads/sec.
Total machine count¶
8 shards (day 1 provisioning) × 3 replicas = 24 machines
64 shards (long-term target) × 3 replicas = 192 machines
192 machines sounds like a lot. But at 250TB across 64 shards, each shard holds ~4TB. A 4TB SSD machine is commodity hardware. This is well within the infrastructure budget of a system handling 100M users.
Replication lag — the one problem¶
Async replication means secondaries are not always perfectly up to date. After a write to the primary, there's a small window — typically milliseconds — before the secondaries reflect the change.
For most redirects this doesn't matter — the URL was created days or weeks ago and is fully synced across all replicas.
The one case it matters: the creator clicking their own link immediately after creation. That's the read-your-own-writes problem — covered in the next file.
Interview framing
"3 replicas per shard — 1 primary, 2 secondaries. Writes go to primary, reads to secondaries. 3 replicas because losing 1 still leaves 2 machines — you stay fault tolerant while provisioning a replacement. With 8 shards on day 1, that's 24 machines. Long-term at 64 shards, 192 machines. Async replication means secondaries are slightly stale — covered by read-your-own-writes for the creator edge case."