The Enterprise AI Reliability Crisis

Today we're announcing our $100 million Series A led by Vinod Khosla at Khosla Ventures.

My co-founder Dan Klein and I have been building AI for decades. He leads the Berkeley Artificial Intelligence Research Lab and Berkeley NLP Group at UC Berkeley and is one of the most cited NLP researchers in the world. Before founding Scaled Cognition, we built Semantic Machines, one of the first agentic AI companies which was acquired by Microsoft. We spent years there building and applying AI to massive scale enterprise workloads, and despite our resources we continually ran into a reliability gap.

You'd test an interaction, it would seem perfect and it would appear the technology was robust. Upon deeper inspection, you'd uncover grievous errors, systematic ones, hiding behind convincing responses. We've all seen this pattern by now: the demo is exciting, the launch video is compelling, but under real world conditions the system is unreliable and deployment is quietly rolled back.

The consequences of making errors in the real world can be significant. An AI takes a prescription refill request, sends a hallucinated prescription ID to the pharmacy, and the customer picks up the wrong drug. The model was confident the whole time, it didn't hedge, it didn't flag uncertainty, it just produced a wrong answer that sounded exactly like a right one.

What we came to understand is that this wasn't a prompting problem or a fine-tuning problem, it was structural. General purpose LLMs generate tokens. They're probabilistic, designed to produce what's most plausible, not what's correct. In high-stakes workflows, these two things are not compatible.

As an AI Lab, we went back to first principles and rethought the architecture from the ground up. One of the core breakthroughs is doing for conversational AI what verifiable reinforcement learning has done for coding, reliability engineered in from the start, not bolted on after the fact.

Our focus has been on developing super-reliable intelligence: models that are designed to tell the truth, eliminate large classes of problematic hallucinations and stick to the rails you define. A byproduct of the approach are smaller, faster, less expensive models, while being more accurate and predictable than the frontier models they outperform.

We launched commercially in February, and today around 100 large enterprises including many in the Fortune 500 across financial services, healthcare, telecom, and insurance are building on our platform. These companies can't afford mistakes in their customer facing AIs. Over the next year or so we expect our models to automate over a billion customer service interactions. Customer experience is the first domain we've applied our technology to, but we see opportunity across broad swaths of enterprise processes where automation with correctness, speed, and unit economics matter.

For decades, Global 2000 companies outsourced their contact centers, IT support, and back-office operations to BPOs not because they wanted to, but because running these at scale required armies of people and labor arbitrage made that math work. That market is at least $600 billion a year and AI breaks that math entirely.

When intelligence is software, the logic of outsourcing disappears. But what we're hearing from enterprises is that they don't want to swap one managed service for another outsourced AI black box. They want to own their intelligence, built on their infrastructure, trained on their policies, deployed on their terms. A sovereign AI workforce they actually control. And that only works if the AI is reliable, and can be run in-house.

Frontier models have an amazing capacity for intelligence. What they haven't solved is reliability. When we go into large enterprises, we consistently find that their actual hallucination rate is five times what they think it is. They don't know because the model always sounds confident, that's exactly what makes it so dangerous. When a system is wrong 30% of the time, complex issues go unresolved and customers lose trust. With our models complex issues are resolved correctly, that compounds into hundreds of millions in operational savings and customers who are genuinely astonished with how useful the technology is.

Beyond the model itself, we've built everything enterprises need to author test, deploy and monitor these systems in their VPCs. Agentic tooling, simulation and evaluation frameworks, and live agent monitoring, the full stack to build, test, and operate reliable AI at scale.

Reliability is THE key challenge facing broad and sustained adoption of customer facing AI in the enterprise. We're solving that challenge. If you want to work on one of the most important challenges in AI, we're hiring.

‍

The Enterprise AI Reliability Crisis

Table Of Contents

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat."

Related Posts

The Enterprise AI Reliability Crisis

Building Models that Can’t Lie

The Smell Is Gone. The Errors Aren't.

The Enterprise AI Reliability Crisis

Ready for real CX agents?