Technology

Why We Spent Six Months on One Safety Test

In a field obsessed with scale, we chose to invest our time in the work nobody benchmarks.

Elena Marsh  · May 8, 2026 · 12 min read

Last November, our team made a decision that felt counterintuitive at the time. While the industry was racing to ship larger models on shorter timelines, we paused an entire product cycle to run a single, deeply unglamorous evaluation. The test — internally called Meridian — measured how our system behaved in long, ambiguous conversations where the stakes were real but the right answer was not obvious.

The Problem with Moving Fast

The pressure to ship is real and it is not going away. Every quarter brings new benchmarks, new architectures, new papers claiming breakthroughs. But what we kept noticing — in our own reviews and in the field broadly — was that the most consequential failures were not the ones caught by standard evaluations. They were the quiet ones: a model that subtly reinforced a user's flawed assumption, or gave confidently wrong medical context in a way that felt authoritative.