Why We Spent Six Months on One Safety Test
In a field obsessed with scale, we chose to invest our time in the work nobody benchmarks.
Last November, our team made a decision that felt counterintuitive at the time. While the industry was racing to ship larger models on shorter timelines, we paused an entire product cycle to run a single, deeply unglamorous evaluation. The test — internally called Meridian — measured how our system behaved in long, ambiguous conversations where the stakes were real but the right answer was not obvious.
The Problem with Moving Fast
The pressure to ship is real and it is not going away. Every quarter brings new benchmarks, new architectures, new papers claiming breakthroughs. But what we kept noticing — in our own reviews and in the field broadly — was that the most consequential failures were not the ones caught by standard evaluations. They were the quiet ones: a model that subtly reinforced a user's flawed assumption, or gave confidently wrong medical context in a way that felt authoritative.