Purchase this theme on shadcnblocks.com
← Blog us-east-1
Dispatch № 003 January MMXXVI

The call we never made.

A page that didn't fire, an oncall who didn't wake, and the forty-second window between the two.

Lede

The best incidents are the ones nobody remembers. This one we only know about because the system filed a receipt.

The spike

At 02:41 UTC the auth-edge p99 jumped from 38ms to 410ms. Three regions saw it; one of them was already shedding load to a sibling. The dashboards we no longer keep would have painted the spike in red. The dashboards we do keep didn’t mention it, because the system was already deciding what to do.

The forty seconds

For forty seconds the spike was held in a window the system maintains for exactly this case — a quiet zone where a decision can be unmade. During those forty seconds the load shed itself, the sibling region absorbed the overflow, and a config diff that had been queued by an automated retry-policy review four hours earlier was rolled back without ceremony.

The cheapest incident is the one whose root cause has already been scheduled for removal.

The page that didn’t fire

There was no page. The Watch on the desk stayed dark. The oncall’s phone — on the bedside table, do-not-disturb until 06:00 — never lit up. By 02:42:21 the p99 was back at 41ms and the system had filed a one-line note in the morning queue.

The morning after

The note read: auth-edge · transient p99 · resolved at source · diff #2014 reverted · no human action.

There was a time when this would have been a five-page incident report, a Slack thread, a 9am standup with the word “learnings” in it. Now it is a line in a queue, read with coffee, deleted without comment. We are not sure that is progress — only that it is what we built.