Retry Strategy
The retry strategy determines how hard the system tries to recover from a failed S3 upload before declaring the paste permanently broken.
Why a message queue, not a direct retry¶
The naive approach: when S3 upload fails, the app server retries in a loop.
Problems: - App server might crash mid-retry — upload job is lost entirely, paste stays IN_PROGRESS forever - Retrying in the same process ties up app server threads - If S3 is down for 30 minutes, the app server is sitting in a retry loop for 30 minutes
The right approach: enqueue the upload job into a message queue (e.g. SQS, Kafka). A dedicated upload worker picks up the job and handles retries independently of the app server.
App server:
→ INSERT paste row (status = IN_PROGRESS)
→ enqueue { shortCode, contentBytes } to upload queue
→ return 201 to user
Upload worker:
→ dequeues job
→ attempts S3 upload
→ on success: UPDATE paste (s3_url, status = PROCESSED), ack message
→ on failure: retry with backoff
→ on exhaustion: UPDATE paste (status = FAILED), ack message
The message stays in the queue until the worker explicitly acknowledges it. If the worker crashes mid-upload, the queue redelivers the job to another worker. No jobs are lost.
Exponential backoff with random jitter¶
You don't want to hammer a struggling S3 with immediate retries. If S3 is returning 500s, hammering it makes recovery slower for everyone.
Exponential backoff: each retry waits twice as long as the previous one.
Random jitter: add a small random offset to each wait to prevent multiple workers from retrying in lockstep (thundering herd).
Attempt 1: immediate
Attempt 2: wait 1s + jitter (±200ms)
Attempt 3: wait 2s + jitter
Attempt 4: wait 4s + jitter
Attempt 5: wait 8s + jitter
Attempt 6: wait 16s + jitter
→ all 6 failed → mark status = FAILED
Total retry window: ~31 seconds. If S3 doesn't recover within ~30 seconds, the paste is marked FAILED. This is the SLO for paste creation — if S3 is down longer than 30 seconds, some pastes will fail permanently.
What happens to FAILED rows¶
FAILED rows stay in Postgres but are permanently unreadable. A background cleanup job sweeps them up periodically — say, once a day — and deletes rows where status = FAILED and created_at is older than 24 hours.
This keeps the table clean without any urgency. A FAILED paste doesn't affect any other part of the system — it's just a dead row.