Blog

Intelligence Is Not the Bottleneck

Reliability — not capability — is what determines whether enterprise AI can actually ship. Dan Klein on the reliability gap nobody's measuring, why the popular workarounds fail, and what it takes to build trust into AI from the ground up.
14 Apr 26

Intelligence Is Not the Bottleneck

Last week at HumanX, I was on stage with Vinod Khosla talking about something the AI industry doesn't discuss enough: reliability. Not capability. Not reasoning. Not the next benchmark. Reliability — the thing that actually determines whether you can ship.

The conversation stayed with me, so I wanted to expand on it here.

The race for the past several years has been to make models smarter. And we've succeeded. Intelligence is abundant. What remains genuinely limited is the ability to trust what these systems do — especially when the stakes are real.

The gap nobody's measuring

Most enterprises think they know their hallucination rate. They don't — and the gap is bigger than you'd expect.

The reason isn't negligence. A standard LLM is trained to produce plausible output — by design, its mistakes are often indistinguishable from correct answers until you really probe them. Most hallucinations pass for the truth. When you're quoting someone's bank account balance, their medical records, or taking any action with real consequences, that hidden error rate isn't just a product issue. It's a liability.

Why the popular workarounds fail

The industry has developed two responses to this reliability gap, and both are dead ends.

The first: constrain the model so tightly it can only make the smallest decisions. The result is an IVR experience with a chatbot face on it. You've solved for reliability by removing everything that made an AI-focused approach valuable.

The second: add a second model to check the first. Now you have two noisy models reviewing each other. Tangles of model calls isn't the path to reliability, it's the path to spiraling latency and token consumption.  You get added complexity and compounding errors.

What actually works

Reliability can't be retrofitted. It has to be the architectural starting point.

At Scaled Cognition, actions and information are first-class objects in our system — not outputs to be filtered after the fact, but things the model is explicitly built to reason about. That architecture what allows us to make real guarantees about what a model will do and, just as importantly, what it won't. The system is also self-auditing: you can see where information went, what actions were taken, and why — not a post-hoc explanation invented by a second model, but a verifiable record built into how it works.

An agent that can't act isn't an agent

The reliability gap might be tolerable in certain contexts. In an agentic world, it isn't.

An agent takes actions. That's definitional — it's what makes an agent an agent. And actions have consequences. You can't responsibly act unless you know you're right. Vinod made this point with some force at HumanX: "It's almost an oxymoron to say you're an agent and you can't do anything." Most of what passes for agentic AI in production today can't take consequential actions because the underlying models can't be trusted enough to take them.

Reliability built in means you can move fast

One underappreciated consequence of getting reliability right architecturally: you don't need months to get to production.

The reason enterprise AI deployments take so long isn't that the use cases are complicated. It's that reliability is being retrofitted — companies spend months tuning prompts, running evals, trying to squeeze performance out of a system that wasn't designed for this degree of control. Eventually you run out of ways to tweak the prompt. That's not a deployment problem. That's an architecture problem.

When reliability is built in from the start, you can move fast. That's the insight behind Agent Twin, which we announced at HumanX. Feed it transcripts from your existing human support agents — the ones already following your policies, already handling your edge cases — and within a day it produces an agent that replicates those behaviors, reliably and verifiably. Not because we found a shortcut, but because a system architected for reliability doesn't need months of prompting to get there.

The question that should drive everything

The industry keeps asking: how smart is this model? That's the wrong question for anyone trying to build something real.

The question that determines whether you can ship is: can I trust it to act? That's the idea behind what we call super-reliable intelligence — not just models that are capable, but models you can depend on to act correctly when it counts. Trust isn't a layer you add. It's a foundation you build — or it’s something you spend the rest of the project trying to compensate.

About Dan Klein

Dan Klein is co-founder and CTO at Scaled Cognition, applying decades of AI research and company-building to solve one of enterprise AI's hardest problems: hallucinations. As a professor at UC Berkeley, Dan is one of the most decorated NLP researchers in the field, and his students have gone on to lead AI at top universities and model labs. As an entrepreneur, he co-founded adap.tv, applying machine learning to advertising, acquired by AOL, and co-founded Semantic Machines, making breakthroughs in action-taking neural models and program synthesis before its acquisition by Microsoft.

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat."

Daniel De Castro
Co-Founder & COO at X Company
Webflow logo