Skip to content

Measuring Availability

Availability is not "is the server running?" — it is "did the message get delivered?" A connection server can be alive and dropping every message it handles. That is not available.


The health check trap

The instinct when monitoring availability is to ping the server — send a heartbeat every few seconds and check if it responds. If it responds, it's up.

Consider this scenario:

App server is running ✓
Health check endpoint returns 200 ✓
But: DynamoDB circuit breaker is OPEN
Result: every message write returns 503

From the health check perspective: server is healthy. From the user's perspective: no messages are being sent. The health check missed the actual failure entirely.

Availability must be measured on real user requests, not synthetic pings.


The availability SLI formula

Availability = successful operations / total operations

Message delivery availability:

Each app server tracks two counters: - messages_attempted — incremented on every message received - messages_delivered — incremented when delivery ack received from recipient

Delivery availability = messages_delivered / messages_attempted
Target: 99.99%

Connection success availability:

Each connection server tracks: - connection_attempts — incremented on every WebSocket upgrade request - connection_successes — incremented on successful WebSocket establishment

Connection availability = connection_successes / connection_attempts
Target: 99.9%

What counts as success for message delivery

Message sent → delivery ack received within 30s     ✓ success
Message sent → recipient offline → pending_deliveries → delivered on reconnect  ✓ success
Message sent → DynamoDB circuit OPEN → 503 returned  ✗ failure
Message sent → app server timeout                    ✗ failure
Message sent → rate limited (429)                    ✓ success (system worked correctly)

Rate limited messages count as successes — the system responded correctly to an abusive client. A 503 from a DynamoDB outage counts as a failure — the system broke.

Pending delivery (offline recipient) counts as a success — the message is durably stored and will be delivered. The system fulfilled its contract.


What counts as success for connections

WebSocket upgrade → connection established → authenticated  ✓ success
WebSocket upgrade → server overloaded → 503               ✗ failure
WebSocket upgrade → invalid auth token → 401              ✓ success (system worked correctly)
WebSocket upgrade → timeout                                ✗ failure

401 is a success — the system correctly rejected a bad token. 503 under load is a failure — the system couldn't serve the request.


Separate availability per path

Message delivery and connection success run on different infrastructure and have different failure modes. Measuring them separately means a connection storm pages immediately even while delivery metrics look healthy.

Connection availability SLI:  connection_successes / connection_attempts  → target 99.9%
Delivery availability SLI:    messages_delivered / messages_attempted     → target 99.99%

Interview framing

"Availability is measured on real traffic — successful operations divided by total. Rate limiting and 401s count as successes. 503s and timeouts count as failures. Pending delivery counts as success — message is durable. Connection and delivery are tracked separately so a connection server failure alerts independently of delivery health."