I spent two weeks last January buried in post-mortems from the previous quarter. Eleven incidents, seven of which had dashboards that should have fired alerts minutes before the pages came in. The panels were green. The thresholds were set. The data was there — but nobody was watching the right thing.

The trap of dashboard completeness

Most teams build dashboards reactively. Something breaks, you add a panel. An exec asks a question, you spin up a new row. Six months later you have forty-seven panels across nine dashboards and nobody can tell you which three metrics actually predict the next outage. The dashboard becomes a museum of past incidents rather than a tool for preventing future ones.

“We reduced our monitoring surface from 2,400 metrics to 340. Mean time to detection dropped from twelve minutes to ninety seconds.” — Platform team retrospective, Q3 2024

The shift started when we stopped asking “what should we monitor?” and started asking “what broke last time, and which single metric would have told us first?” Working backward from incidents gave us a ranked list. Everything else became noise we could safely archive.