Engineering

Why We Replaced Static Thresholds With Signal-Based Alerting

After three years of on-call rotations drowning in noise, our team rebuilt alerting from first principles. The results changed how we think about incidents entirely.

Marcus Chen • Jan 15, 2025 • 8 min read

Last November, I sat in a war room reviewing the aftermath of a cascading failure that took down our payment processing pipeline for forty-seven minutes. The irony was painful: alerts had fired for hours, but the team had learned to ignore them. Our alert-to-incident ratio was 340:1.

The Static Threshold Trap

Every alert in our system was defined the same way — a hard numeric threshold crossed for a sustained period. CPU above 85% for five minutes. Error rate above 2% for three minutes. These rules felt precise when we wrote them, but auto-scaling groups routinely crossed them during healthy deploy windows.