Schema
Schema design follows access patterns
The schema is not designed by listing columns you think you need. It is designed by looking at the queries the system runs — then building the table and indexes around those queries.
The dominant query¶
SELECT long_url FROM urls WHERE short_code = 'x7k2p9';
This runs 100k times per second. Everything about the schema — the table structure, the index type, the column order — is optimised for this single query.
The schema¶
urls
-------------------------------
id BIGINT PRIMARY KEY ← internal row ID, never exposed to users
short_code VARCHAR(6) UNIQUE ← what users see, must be globally unique
long_url TEXT ← what we redirect to
created_at TIMESTAMP ← when the mapping was created
expired_at TIMESTAMP NULL ← optional expiry, null means never expires
id — internal primary key. The DB needs a primary key for its internal B-tree. This is never returned in any API response. The short code is the external identifier.
short_code — 6 characters, base62. Must be unique across the entire table — the DB enforces this via a unique constraint. VARCHAR(6) not TEXT because the length is fixed and known.
long_url — TEXT because URLs can be arbitrarily long. No fixed length limit.
created_at — useful for debugging, analytics, and future features. Cheap to store.
expired_at — nullable. Most URLs have no expiry. When set, the redirect query adds AND (expired_at IS NULL OR expired_at > NOW()) to filter out expired URLs.
The indexes¶
Index 1 — unique index on short_code
CREATE UNIQUE INDEX idx_short_code ON urls(short_code);
Two jobs:
1. Enforces uniqueness — DB rejects duplicate short codes at insert time
2. Makes collision checks fast — SELECT 1 WHERE short_code = 'x7k2p9' is O(log n), not a full table scan
Index 2 — covering index on (short_code, long_url)
The redirect query needs long_url given short_code. With a regular index on short_code, the DB does two steps:
Step 1 → look up short_code in index → find the row's physical location on disk
Step 2 → fetch the actual row from disk → read long_url
Step 2 is called a row lookup — an extra disk read on top of the index read. At 100k reads/sec, this doubles your disk I/O.
A covering index includes long_url inside the index structure itself:
CREATE UNIQUE INDEX idx_short_code_covering ON urls(short_code) INCLUDE (long_url);
Now the DB finds long_url directly inside the index — no row lookup needed. The redirect query is served entirely from the index. One disk read instead of two, at every single redirect.
Without covering index → 2 disk reads per redirect × 100k/sec = 200k disk reads/sec
With covering index → 1 disk read per redirect × 100k/sec = 100k disk reads/sec
Half the disk I/O, same result.
The QPS reality check¶
Even with a covering index, a single Postgres instance handles roughly 10k–50k reads/sec under real conditions. At 100k reads/sec average and 1M/sec peak, a single DB falls over.
This is not a schema problem. The schema is already optimal — covering index, no joins, minimal columns. The problem is raw throughput. The fix is:
- Caching — serve 80% of reads from Redis, DB sees only 20k/sec → covered in Caching deep dive
- Sharding — split 250TB across multiple nodes → covered in the next files
Interview framing
"One table — id, short_code, long_url, created_at, expired_at. Covering index on (short_code, long_url) so the redirect query never touches the actual row — index-only scan, half the disk I/O. Single Postgres tops out at 50k reads/sec — caching brings effective DB load to 20k/sec, sharding handles the storage limit."