In January 2023, our on-call rotation broke. Not metaphorically — four engineers in a single week reported they could no longer distinguish between a real incident and the ambient noise of a system running at its ceiling. The login flow, which depended on eleven synchronous service hops, had crept from 400ms to over six seconds at the P99. We had built a distributed monolith and called it microservices.
The cascade problem nobody warned us about
We traced the root cause to a pattern that felt natural at the time: service A calls service B, which calls service C, and so on. Each hop added network overhead, serialization cost, and retry logic. When the identity service slowed by 200ms under load, the downstream path stalled across checkout, recommendations, and notifications.