I spent two weeks last January staring at dashboards that told me everything was fine. Our p99 latency sat under 200ms. Uptime held at four nines. The error budget spreadsheet our VP printed and taped to the office wall showed 47 minutes of remaining downtime for the quarter. By every metric we'd agreed upon, we were winning.
The Problem With Averages
Here is what those numbers did not say: our payment pipeline had been silently dropping 0.3% of webhook events for six weeks. Not failing, just losing them in a race condition that only triggered under specific load patterns. Customers saw no errors. They saw nothing. Orders confirmed, money charged, fulfillment never triggered. We found out because someone's grandmother called support asking why her birthday gift never arrived.
"Reliability is not a number on a dashboard. It is whether your users' grandmothers get their birthday presents on time."