Indexing Strategy
The cleanup job and watchdog both query by status — but status has low cardinality, making a naive index on it nearly useless. Partial indexes solve this precisely.
The low cardinality problem¶
Status columns look like good index candidates — queries filter on them constantly. But cardinality matters enormously for index usefulness.
Cardinality = how many distinct values exist relative to the number of rows.
pastes table: 3.65B rows
deletion_status values: NOT_EXPIRED, DELETION_IN_PROGRESS, DELETED
Distribution at steady state:
NOT_EXPIRED: ~95% of rows (pastes not yet expired)
DELETED: ~4.9% of rows (already cleaned up)
DELETION_IN_PROGRESS: ~0.1% of rows (currently being processed)
A B-tree index on deletion_status would have 3 leaf nodes, each pointing to billions of row pointers. When the cleanup job queries WHERE deletion_status = 'NOT_EXPIRED', Postgres would use the index to find that node — then fetch 95% of the table anyway. At that point, a full table scan is faster. Postgres's query planner would likely ignore the index entirely.
Low cardinality indexes don't just fail to help — they waste storage and slow down writes (every INSERT/UPDATE must maintain the index). They are actively harmful.
Partial indexes — index only the rows you actually query¶
A partial index is an index with a WHERE clause. It only indexes rows that satisfy the condition. Rows that don't match are not in the index at all.
For the cleanup job's main query:
CREATE INDEX idx_cleanup_expired
ON pastes (expires_at)
WHERE deletion_status = 'NOT_EXPIRED';
This index contains only NOT_EXPIRED rows, sorted by expires_at. At any given time, that's pastes which haven't expired yet — a large set initially, but the query filters to expires_at < now(), which is a small, time-bounded range at the front of the index.
For the watchdog query:
CREATE INDEX idx_watchdog_stuck
ON pastes (deletion_initiated_at)
WHERE deletion_status = 'DELETION_IN_PROGRESS';
This index contains only DELETION_IN_PROGRESS rows. At steady state that's a tiny fraction of 3.65B rows — maybe a few thousand at most. The entire index fits in a few pages. The watchdog's query hits it instantly.
Why not include deletion_status in the index body?¶
You might think to write:
CREATE INDEX idx_watchdog_stuck
ON pastes (deletion_status, deletion_initiated_at)
WHERE deletion_status = 'DELETION_IN_PROGRESS';
But the partial condition WHERE deletion_status = 'DELETION_IN_PROGRESS' already guarantees that every row in this index has deletion_status = 'DELETION_IN_PROGRESS'. Storing it again in the index body is redundant — it wastes space in the index and adds no query benefit.
The single-column partial index is smaller, faster to maintain, and equally effective:
-- Redundant (status stored twice — once in partial condition, once in body)
ON pastes (deletion_status, deletion_initiated_at)
WHERE deletion_status = 'DELETION_IN_PROGRESS'
-- Correct (status filters via partial condition, body only stores what's needed)
ON pastes (deletion_initiated_at)
WHERE deletion_status = 'DELETION_IN_PROGRESS'
Summary of indexes for the cleanup system¶
-- Cleanup job: find expired, unprocessed pastes efficiently
CREATE INDEX idx_cleanup_expired
ON pastes (expires_at)
WHERE deletion_status = 'NOT_EXPIRED';
-- Watchdog: find rows stuck in DELETION_IN_PROGRESS
CREATE INDEX idx_watchdog_stuck
ON pastes (deletion_initiated_at)
WHERE deletion_status = 'DELETION_IN_PROGRESS';
Two small, targeted indexes. Each serves exactly one query pattern. Neither wastes space on rows that are irrelevant to that query.
Interview framing
"Status columns have low cardinality — a full index on deletion_status is nearly useless because Postgres would still scan most of the table. Partial indexes solve this: the cleanup job gets an index on expires_at filtered to NOT_EXPIRED rows only, and the watchdog gets an index on deletion_initiated_at filtered to DELETION_IN_PROGRESS rows only. Each index is tiny because it only covers the small subset of rows that the query actually needs. Including deletion_status in the index body would be redundant — the partial condition already guarantees every row in the index matches that status."