Skip to content

Hashing

Building on the previous

Raw IDs failed because they expose internals and break under sharding. Hashing solves the exposure problem — a hash looks random, reveals nothing. And we can hash on the app server, so no DB coordination needed. Let's try it.


Approach 1 — Hash the long URL directly

Take the long URL, run it through a hash function (MD5, SHA-256), use the output as the short code.

Input  → https://very-long-url.com/with/path
MD5    → a3f8c2d1e9b047f6a1c3e5d8b2f4a6c9  (32 hex characters)

Problem: 32 characters is not short. A URL shortener with a 32-character code defeats the purpose.

So the instinct is to trim — just take the first 6 characters:

MD5 output  → a3f8c2d1e9b047f6...
Trimmed     → a3f8c2
Short URL   → bit.ly/a3f8c2

Why trimming breaks uniqueness:

A hash function maps any input to a fixed-length output. The full output is designed to be unique. When you trim it, you are throwing away bits — and with them, the uniqueness guarantee.

URL 1 → MD5 → a3f8c2d1e9b0...  → trimmed → a3f8c2
URL 2 → MD5 → a3f8c2aa71f3...  → trimmed → a3f8c2  ← collision ✗

Two completely different long URLs now map to the same short code. bit.ly/a3f8c2 can only point to one of them. The other is lost.

This violates the core reliability requirement: no two long URLs can share a short code.


Approach 2 — Hash the ID instead

The problem with hashing the long URL is that long URLs can be hundreds of characters. What if we hash something shorter — like the DB row ID?

DB assigns ID  → 4821903
Hash the ID    → 9b2f4a...  (shorter input, same hash length)
Trim to 6 chars → 9b2f4a
Short URL      → bit.ly/9b2f4a

The input is shorter, but the output length is still determined by the hash function — not the input. MD5 always outputs 128 bits regardless of whether you feed it 7 characters or 700.

So trimming still breaks uniqueness. The same collision problem applies.


The fundamental problem with hashing

Hashing solves the exposure problem — the output looks random, reveals nothing about your internals. But it does not solve uniqueness when you trim, regardless of what you hash.

Full hash output  → guaranteed unique (by hash function design)
Trimmed output    → uniqueness broken (bits discarded = collisions possible)

Why this fails

Trimming any hash — whether of the long URL or the ID — discards bits and breaks the uniqueness guarantee. You cannot trim a hash and claim it is still unique. The two goals (short code + unique) cannot both be satisfied by hashing alone.