Why We Replaced Static Thresholds With Signal-Based Alerting
After three years of on-call rotations drowning in noise, our team rebuilt alerting from first principles. The results changed how we think about incidents entirely.
Last November, I sat in a war room reviewing the aftermath of a cascading failure that took down our payment processing pipeline for forty-seven minutes. The irony was painful: alerts had fired for hours, but the team had learned to ignore them. Our alert-to-incident ratio was 340:1.
The Static Threshold Trap
Every alert in our system was defined the same way — a hard numeric threshold crossed for a sustained period. CPU above 85% for five minutes. Error rate above 2% for three minutes. These rules felt precise when we wrote them, but auto-scaling groups routinely crossed them during healthy deploy windows.