Skip to content

Schema Update

The sequence number changes the DynamoDB schema — the sort key needs to change

The original schema used timestamp#message_id as the sort key. Now that seq_no is the ordering mechanism, the sort key should reflect that.


Original schema (before ordering deep dive)

Table: messages

PK: conversation_id
SK: timestamp#message_id    ← composite, collision-safe, chronologically ordered

Attributes:
  message_id, sender_id, receiver_id, content, s3_ref

The composite sort key was designed to: 1. Keep messages in chronological order (timestamp first) 2. Prevent collisions when two messages arrive at the same millisecond (message_id as tiebreaker)

Both of these problems are now solved better by seq_no.


Updated schema

Table: messages

PK: conversation_id
SK: seq_no                  ← globally ordered, collision-free, assigned by Redis INCR

Attributes:
  message_id      → unique ID (client-generated UUID or Snowflake)
  sender_id       → who sent the message
  receiver_id     → denormalized, avoids lookup to conversations table
  content         → message text
  timestamp       → client's local clock at send time, stored as attribute (display only)
  s3_ref          → nullable, pointer to S3 if message archived to cold storage

Why seq_no is strictly better as the sort key

Collision safety: timestamp#message_id needed the message_id suffix specifically because two messages could arrive at the same millisecond and collide on timestamp alone. seq_no is inherently unique — the Redis INCR guarantees no two messages in the same conversation ever get the same number.

Ordering correctness: timestamp#message_id sorted by time, but time is unreliable across servers (NTP drift) and across clients (clock skew). seq_no sorts by logical sequence — the order in which the server processed the messages. Unaffected by any clock.

Simplicity: A single integer is simpler than a composite string. Range queries on SK BETWEEN 1 AND 100 are cleaner than parsing timestamp#message_id strings.


What happens to timestamp

Timestamp is not dropped — it becomes a regular attribute. It serves one purpose only: display.

Before archival:
  SK:        42
  timestamp: "4:20:00"    ← stored as attribute, shown in UI as "4:20 PM"
  content:   "hey"
  s3_ref:    null

After archival (90+ days old):
  SK:        42
  timestamp: "4:20:00"    ← still there, display still works
  content:   null
  s3_ref:    "s3://whatsapp-archive/conv_abc123/msg_aaa.json"

The timestamp persists through archival. Even for messages from 5 years ago, the "4:20 PM" display still works correctly when the message is fetched from S3.


Updated chat history query

PK = conv_abc123
SK < current_seq           (everything before the current cursor)
LIMIT 20
SCAN_INDEX_FORWARD = false (highest seq first = most recent first)

Cursor for pagination: the seq_no of the oldest message in the current batch. Pass it as before_seq in the next request.

First load:   PK=conv_abc123, SK < ∞,   LIMIT 20 → returns seq 480–500
Next page:    PK=conv_abc123, SK < 480, LIMIT 20 → returns seq 460–479
Next page:    PK=conv_abc123, SK < 460, LIMIT 20 → returns seq 440–459

Clean, stable, no pagination drift.

Interview framing

"I'd change the sort key from timestamp#message_id to seq_no. The composite timestamp was a workaround for two problems — collision safety and ordering — that seq_no solves more cleanly. Timestamp stays as a display attribute so the UI can still show '4:20 PM' under each message."