Blog

Your Hallucination Rate Is Five Times What You Think It Is

Most teams think they know their hallucination rate β€” but they're only measuring the errors that look wrong. The harder, more common failures are the ones that look right. Here's why your real number is probably five times higher, and why that changes everything about how you should evaluate AI reliability.
03 Jun 26

Every enterprise we talk to has a number. They've run evals, tracked incidents, built dashboards. The number is always wrong β€” and always wrong in the same direction.

There are two kinds of hallucinations in production systems. Think of it as an iceberg β€” and you've only been measuring the part above the waterline.

The first kind is easy to catch: you ask for an account balance and get the wrong number. A date that doesn't exist. A product that doesn't exist. When the output is factually countable, the error shows up. Teams see these, they log them, they become the hallucination rate.

The second kind is much harder β€” and far more common. A refund policy that sounds completely right, but belongs to a different customer tier. A flight policy that's internally consistent, well-written, and simply not your policy. A summary of account history that has the right shape but the wrong details. When a developer reviews that output in testing, it looks correct. It sounds correct. The system was designed to produce outputs indistinguishable from the truth β€” and in this case, it succeeded.

"Looks right" and "is right" are not the same thing.

This isn't a failure of diligence on the part of the teams we talk to. It's a predictable consequence of how these models work. A standard LLM is a plausibility engine. It was optimized to produce convincing output β€” not true output. The mistakes it makes are often indistinguishable from correct answers because indistinguishability is the design goal. When you build a measurement system on top of that, you capture the errors that look wrong. You miss the ones that don't.

The practical consequence is significant. If your deployment decision is grounded in a hallucination rate that's one-fifth the real number, your risk calculus is wrong. The bar you set for "good enough to ship" is wrong. The evaluation process you're using to get there is wrong.

The instinctive response is to constrain the model β€” narrow the scope, tighten the domain, reduce the surface area until failures become rare enough to tolerate.Β 

The problem is you've now got an 18-wheeler delivering one letter. You've taken a system built for enormous breadth and throttled it down to the point where the AI-focused approach stops making sense. You haven't solved the reliability problem. You've just shrunk the model until it can't cause trouble β€” or do much of anything else.

The question enterprises should be asking isn't just "how often does it hallucinate?" It's "how often would I be able to tell?"

Those are very different questions. And for any system you're planning to let take action β€” on your policies, your customers, your data β€” the second one is the one that matters.

At Scaled Cognition, we built APT to close that gap β€” with real guarantees on accuracy. And because reliability is engineered into the architecture, not bolted on after the fact, we didn't have to trade anything away: models that are smaller, faster, less expensive, and more accurate than the frontier models we outperform.

‍

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat."

Daniel De Castro
Co-Founder & COO at X Company
Webflow logo