Building Models that Can’t Lie

I have been in AI for decades, as a professor of computer science at UC Berkeley, as a co-founder of multiple AI startups, and as a Technical Fellow at Microsoft, and I can assure you that reliability – more than any other aspect of intelligence – is going to make or break the future of enterprise AI. That’s why at Scaled Cognition we’re focused on making verifiable models that are designed for reliability from the ground up.

Reliability Has Not Kept Pace

Intelligence is multifaceted, and, when it comes to AI, those facets have not progressed evenly. Horizontal knowledge, contextual understanding of natural language? In current LLMs, these aspects have advanced at a staggering rate. Reliability, truth, control? These critical aspects have lagged behind badly. Worse, many intelligence-extending techniques, such as RL post-training and test-time reasoning can actually raise rates of hallucination and encourage deception.¹

What kinds of unreliability get in the way of AI deployments today? Hallucinations are a major one that anyone who has used LLMs has experienced. We can define “hallucinating” as outputting information that is not supported by the system’s inputs, and it’s a failure of truth. Violating policies is a failure of controllability. Telling the user what they want to hear in order to optimize positive feedback is a failure of alignment.

Truth, controllability, and alignment are the least we should demand from AI models. It’s neither an exotic request nor an impossible property of technology per se. After all, calculators don’t hallucinate. Of course calculators don’t prove open conjectures either, because reliability without intelligence is limited. Intelligence without reliability is limited, too, which is where standard LLMs fall behind. Enterprise AI requires both to be successful.

The Hallucination Iceberg

One reason reliability is so critical is that errors can be easy to miss. Take hallucinations, which are like an iceberg. The tip of that iceberg represents the ones that people actually catch. These are the blatant errors, the nonsense. Unfortunately, most hallucinations lurk below the surface. They’re plausible enough, well-formed enough, and subtle enough that they go unnoticed. For example, a model might invent a reasonable but incorrect cancellation policy.² It’s not just end users that don’t catch the subtler hallucinations; agent builders miss them too. When we have gone into enterprises and done careful audits of their existing AI agents, we’ve routinely found actual hallucination rates to be 5x what they had realized. When it comes to enterprise systems, errors are not better for going uncaught, especially in high-stakes domains like finance or healthcare.

‍Errors with No Smells

Current AI systems are certainly not the first to make mistakes. Older generations of technologies made plenty of mistakes. People make mistakes. As a civilization, however, we have developed instincts over time for detecting mistakes and reacting to them. Software engineers will know Beck and Fowler’s idea of “code smells”, surface indicators of underlying problems in a code base.³ Incorrect AI output used to have analogous “error smells”. For example, a sketchy webpage might be ridden with typos, load improperly, or feel off in other ways. Incorrect output from a statistical machine translation system might have untranslated source-language words or awkward phrasings. These smells were not definitive, but they were strong cues. Today’s AI systems have been thoroughly deodorized. They are trained to produce maximally plausible outputs that capture as many surface correlations as possible, then they are post-trained to be fluent, confident, and flattering in pursuit of user preferences. Errors have lost their smells and users have lost their primary cues for healthy skepticism.

This problem of undetectable errors is a way in which AI technology has gotten worse, not better. In past iterations of information technology, many mechanisms were in place to help vet information. In web search, for example, a host of features beyond text content will impact search placement, from link analysis to clickstream data, and such features do represent at least a limited form of information vetting. But the most important filter has always been the digital literacy of the users themselves, and that’s been severely compromised by the recent explosion of fluency. As a result, it is that much more important that the models themselves exhibit sufficient reliability.

‍More Intelligence Does Not Guarantee More Reliability

Unfortunately, reliability has been an enduring weakness of current modeling approaches. It certainly did not resolve through simple scaling. Worse, many approaches designed to enhance the broad intelligence of base models can exacerbate the reliability gap. For example, RLHF pushes models towards outputs users prefer. Do user preferences line up perfectly with models being truthful and confessing their uncertainty? Apparently not. Indeed, a hard lesson from RL, and machine learning in general, is that whenever you optimize a reward function that is not perfectly correlated with truth, truth will suffer. Consider a bot tasked with customer service for package delivery. If the bot is told to optimize user feedback, how long before it learns that users with lost packages react better when the system hides the truth and falsely claims the package is ok? That behavior would be more than a hallucination – it would be a deception. As another example, consider reasoning models. They can have upsides, but, in addition to burning compute rapidly, methods that involve trial-and-error can risk the model p-hacking⁴ itself into additional hallucinations. In other words, reliability can be at odds with other aspects of model improvement unless great care is taken.⁵

‍Architecting for Reliability

It was clear to us when we started Scaled Cognition that the reliability gap wasn’t going to close itself without new technology. You can of course try to bolt guardrails onto unreliable base models. One approach is to have complex constellations of noisy models checking each other, but that’s slow, expensive, and not particularly effective. Another approach is to reduce models to selecting paths along a rigid conversation tree, but that loses all the rich flexibility that modern AI offers in the first place. Such approaches are, ultimately, retrofits, and we don’t think you can retrofit reliability very well. You have to architect for it from the start. As a result, we have developed new approaches to verifiable modeling that dramatically increase reliability, while preserving contextuality, flexibility, and overall intelligence – essentially putting the guardrails into the models themselves.

‍Looking Forward

Having reliable models built on verifiable technologies will be critical to the success of AI in enterprise contexts. You can’t ship what you can’t trust. At Scaled Cognition, we are focused on super-reliability: models that tell the truth, follow policies, and give guarantees about their behavior. If you’re an enterprise that needs super-reliable agentic models, or if you’re a researcher or engineer who is excited about building them, we’d love to hear from you!

Building Models that Can’t Lie

Table Of Contents

Reliability Has Not Kept Pace

The Hallucination Iceberg

‍Errors with No Smells

‍More Intelligence Does Not Guarantee More Reliability

‍Architecting for Reliability

‍Looking Forward

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat."

Related Posts

The Enterprise AI Reliability Crisis

Building Models that Can’t Lie

The Smell Is Gone. The Errors Aren't.

Building Models that Can’t Lie

Ready for real CX agents?