Scaled Cognition at HumanX 2026

‍

This week, our CTO Dan Klein joined Vinod Khosla and moderator Anita Ramaswamy of The Information on stage at HumanX for a conversation on AI reliability versus raw intelligence — and what it takes to build production-grade AI for enterprise customer support.

Vinod opened by framing the core market problem: most LLM-based customer support systems hallucinate far more than reported, and at high volumes, that's a real risk — especially when the data in question is a bank balance or a medical record.

Key Takeaways

Scaled Cognition announces Agent Twin — convert human agent transcripts into a fully verifiable, non-hallucinating APT-1-based agent in a single day
Vinod Khosla: the gap between what AI can do and what's actually being deployed is one of the biggest missed opportunities in the market right now
Vinod Khosla: reliability is the defining requirement for enterprise AI — and most products on the market aren't close
Super-Reliable Intelligence isn't a feature you bolt on — Scaled Cognition architected for it from day one, at the model level
While the rest of the industry patches hallucinations with guardrails and ensemble checks, Scaled Cognition eliminates them structurally by treating actions and information as first-class objects
Vinod Khosla: companies doing differentiated R&D — not those building thin layers on public models — are where the lasting value in AI will accrue. Scaled Cognition is that company.

Full Transcript

0:05 — Moderator

All right, hope with the tough questions. We gotta get her out here again. The AI industry races toward ever more powerful models, but in the real world, the bottleneck isn't intelligence, it's of course reliability. To discuss building production-grade AI that you actually trust, please welcome Vinod Khosla from Khosla Ventures. We have Dan Klein from Scaled Cognition, moderated by Anita Ramaswamy from The Information.

0:50 — Anita Ramaswamy

Great to be here. So let's dive right into it. Last session of the day, I've been told. Dan, you're the founder of Scaled Cognition and the CTO. What's the pitch in a nutshell?

1:03 — Dan Klein

The pitch in a nutshell is we make models that don't hallucinate and that give you the ability to make guarantees about what they'll do and, more importantly, what they won't do. And that's really important for things like customer service, where you want to make sure that the model's not just lying to your face and making up policies. Really, any time you're taking actions that have consequences.

1:21 — Anita Ramaswamy

Got it. So Vinod, as an investor in Scaled Cognition, what gives you confidence in Dan's approach? Is it better models? Is it better guardrails?

1:32 — Vinod Khosla

Look, in the end, customers care about reliability. There's a lot of customer support applications. We know the popular names like Sierra and Decagon and so on. They're all built on LLMs that hallucinate. And hallucinate generally in these applications way more than people think. Every evaluation we've been in, I think it's five times what we were told would be the hallucination rate. And when you're getting somebody's bank balance or somebody's medical record, you can't even make things up. And so super reliability is very important. And most of the customer support services are just terrible at it. And frankly, because of that, Scaled Cognition is probably winning almost every head-to-head evaluation they get into — like places where people have done real evaluations, they almost always win in head-to-head, unless somebody is in the fashion business. There are plenty of Silicon Valley fashion names, but they're not great at a product that meets customers' needs.

2:46 — Anita Ramaswamy

I would love to hear just a little bit more from your perspective as an investor also in OpenAI and other foundation models. What makes Scaled Cognition stand out to you, and what's the need for something like that?

2:55 — Vinod Khosla

Well, very simple. If you can't ensure to a customer that you won't have the wrong answer, can you have a customer support business? No, you can't. If you're wrong even 10% of the time, it's a real problem and these things hallucinate. If you have multi-agent systems, the hallucination compounds and your false answer rate becomes huge. I think many of these trials, the customers will have to switch. Scaled Cognition ensures this super reliability. We will not give you the wrong answer, and that's what matters. And because of that, how many of the current famous customer support applications are on track with existing customers to get to a billion conversations a year? That's like a minimum level at which you're a viable customer support company. The answer is I don't think any of them will be doing a billion conversations a year within the next year. But I do think Scaled Cognition with its existing customers — well, they're more big logos than most of the companies.

4:09 — Anita Ramaswamy

So I want to kind of pull back the veil a little bit. Dan, how do you build an agent that doesn't hallucinate?

4:15 — Dan Klein

Yeah, I think if you go back three years to when we were starting Scaled Cognition and to build APT-1, it was really clear that even though there's really been this explosive growth in horizontal intelligence, there is a really big reliability gap between what you can do by just naively prompting an LLM and the bar you need in terms of accuracy to ship these sorts of enterprise applications. I mean, if you're going to quote somebody's bank account balance, you have to be right. And if you're going to be dealing with health care, or really, ultimately, agents — the thing that makes an agent is that it takes an action. And the thing about actions is they have consequences. And you can't really act unless you know you're right. So I think that really, to us, said the central thing is reliability. Reliability is not something you can retrofit. And so we set out to architect a system which is fundamentally about reliability. In a nutshell, the way that works is that rather than just focusing on next token prediction, in our system things like actions and information are first-class objects. That lets us build a system where you can actually make guarantees about what it will do and what it will not do, and eliminate big classes of hallucinations in a provable, guaranteed way.

5:28 — Vinod Khosla

The simple way I would explain this is: the way to quickly get into the market is take OpenAI, or Claude, or Gemini, and put a layer on top. What Scaled Cognition did is it did some phenomenal research and developed a different approach than LLM, and then combined it with the best of LLMs. So it took more research, more developmental risk, and most people are too lazy to do that.

5:57 — Anita Ramaswamy

So I guess there seems like a bit of a tension here between the idea of building an agent that is super compliant, can follow all the appropriate instructions, but is also flexible enough to flow with the needs of the customer. How do you balance those things, Dan?

6:12 — Dan Klein

Yeah, I think I could maybe start with some anti-patterns. Here's what you don't do. One of the things you see a lot of people doing — and because these systems, standard LLMs, do make things up, they make things up all the time — they go rogue, and they kind of squeeze down the LLM until it's only making the tiniest decisions in basically a straitjacket, and then you more or less have an IVR-like experience. It's terrible. That's one way people try to get reliability. Another way people try to get reliability is you have one noisy model, so you bring in a second noisy model to check it. And as the joke goes, now you have two problems. This isn't a path to reliability. It's a path to increased latency and increased token consumption. But it's not how you make a reliable system. The way we make reliable systems is we architect for that in the first place. And we build models where really the priority — in both the architecture of the model and really the entire training of it, the data we generate, the architecture of the model, and the structures that it's deployed into — is reliability, and that's a complement to the kinds of intelligence you need to understand human language and carry on that part of a conversation in a flexible way. So the answer is you need both, and you can get both if you architect for it.

7:19 — Vinod Khosla

I think you need to know when to use one model. You don't just use one model. Sometimes you use the general models, like OpenAI or Anthropic or something, and you get the flexibility and the breadth. But when you need a precise answer — your bank balance — you use a different model. And so depending upon the task at hand, you have to dynamically configure your answer to meet what it needs to meet. A customer support agent trying to make you feel heard — I don't need a rigorous model that just does that. I'm trying to be empathetic because you're upset. That's a different model than if I'm trying to give you your bank balance or how high your cholesterol is.

8:01 — Anita Ramaswamy

One challenge I think a lot of startups in AI are grappling with is the shift from models that are just purely predictive to ones that can reason and show their work. How are you approaching that, Dan?

8:24 — Dan Klein

Yeah, it's a great point. I think one of the things you need to have an enterprise-grade solution is reliability but also adherence to policy, auditability — having a system that you can figure out what it did, why it did it, and be able to guarantee boundaries on what it will and won't do. That's really what we're optimizing for. And I think that's really important because a lot of the work that's out there about explainability is post hoc. A system does something and another model tries to invent a reason why that might have happened. I think it's much more important to have a system which is self-auditing and is in a verifiable context — that's how our system works. It traffics in information, where that information goes, what actions are taken, and why they were taken. So it's sort of self-auditing.

9:09 — Vinod Khosla

And I want to emphasize something Dan has mentioned, but doesn't stand out. When you're doing customer support, especially in the agentic world, which we are all getting to, you have to take actions. And you can't take actions unless you're sure. It's just almost an oxymoron to say you're an agent and you can't do anything. You can search information, find text, but you can't be sure enough to actually take an action.

9:44 — Anita Ramaswamy

Does that mean that sometimes there's a case where the agent may not act because it's not sure?

9:50 — Vinod Khosla

Well, I think you should think about it like: can it behave like a human? Most of the time, a customer support human agent can actually take an action. And that brings up another topic. How fast can you get these systems up and running? But back to this idea — if a human can do the action, then your customer support agent should be able to do the action. And most such agents in the market today cannot take that action because they are not sure. They're hallucinating.

10:46 — Anita Ramaswamy

So I do want to ask about the pace, then. How fast can you get these systems up and running?

10:51 — Dan Klein

That's a great question. And I think that's actually been a hard thing for the industry. Because one of the things I think we all know in this era of large models is that spinning up a prototype and making a demo is really easy, but getting something to be a product that has that level of reliability is hard and sometimes impossible. You get to the point where you just have no more ways to tweak the prompt. And the prompt-and-pray approach really doesn't scale. So one of the things that's really important is not just the model itself but also the technologies that surround that. For example, one of the technologies we have is the ability to do very sophisticated simulations and evaluations to make sure you know that the system is going to perform and cover all of the use cases you want. But something that I find really exciting and that we're actually announcing today is a new technology of ours called Agent Twin, where what we do is we take transcripts of human agents — human customer service agents — and the system will take that as input and then within a day give you out an APT-1-based agent, which will do those behaviors, cover those cases in an automated way. That's super exciting, because if anybody who's tried to work with these things knows, it can be a very long process trying to figure out what your policies even are. And here you have agents already following them. This is really exciting for our customers, because it gets them up and running same-day with agents that don't hallucinate.

12:16 — Anita Ramaswamy

What would you say has been the most challenging technical aspect of building this product out?

12:24 — Dan Klein

I think it really all comes down to what our focus is, which is on verifiability and reliability. We build models that have verifiable and reliable behavior. We have evaluations which are able to do that sort of verification. We have this new Agent Twin technology that is able to build a verifiable model that replicates those behaviors. And really wanting to make sure — as Vinod was saying — that you have an architecture which is fundamentally about knowing and being sure and taking actions that are not based on hallucination. Any time you use a standard LLM that is not explicitly modeling actions in the right way, you're going to see hallucinations. And what I think we don't really fully appreciate is that that is the tip of the iceberg. A standard LLM is trained to predict plausible output — it produces outputs that are indistinguishable from the truth, and that means most hallucinations pass for the truth. We often go into enterprises and they say, hey, we're talking to you because we can't get the hallucination rate low enough. And we say, what do you think that rate is? We run evaluations. And as Vinod said, it may be 5x what they thought. Why is that? Most hallucinations look like the truth until you really poke at it.

13:45 — Vinod Khosla

From a customer point of view, one thing matters: how fast can they evaluate things? What's time to value? If it's seven to nine months to get something up and running, it's a lot of upfront cost and you're delaying the benefit for a long time. So time to value matters. And if you have things like Agent Twin, you can give the system your customer support manual — like you would to a human agent you hired — but also show it hundreds of thousands of conversations your human agents have already had. It puts all that together and is up and running in a day or two, not a year or two. That matters to customers. Time to value is critical in AI systems, and Agent Twin gives you that. You can literally be up and running in days or weeks, not months or years.

14:51 — Anita Ramaswamy

I'm curious to know — and I think this is a bit of a classic question that gets asked in tech — where does the value accrue? What is the most valuable part of this stack? Is it really the model itself or is it the monitoring?

15:02 — Vinod Khosla

Oh, I'm glad you brought this up. One of the problems is — and I'm glad for people like OpenAI and others who have very high margins — but many of the things built on top get margin-squeezed. If you look at somebody like Cursor, going away from using their own models so they can get some margin. Many of the voice applications, same thing. Scaled Cognition is one of the few companies in the AI world that can get 90%-plus margins today — not at some point in the future — because they have their own models. And then for flexibility, they'll use the general model when needed. But most of the time, it's a very cost-effective solution. So the company itself can have very high margins, and that's a nice thing to have.

16:03 — Anita Ramaswamy

So is that value coming from the model itself?

16:05 — Vinod Khosla

Here's the way to put it. If you rely on all the value the big models create, you're going to get a sliver. If you create real value — and this goes back to: did you create any technology yourself that's differentiated? — then you will get paid for it. So it's really a function of how much R&D a company does on things of lasting value that determines who gets value in the AI world.

16:35 — Anita Ramaswamy

To make that a little more specific to Scaled Cognition — is the valuable aspect the foundation model itself that you built, or is it the monitoring systems you put on top of it?

16:47 — Dan Klein

I mean, I think the answer is all of the above. Really everything — when reliability is a concern, it's very important that you have models that have the right reliability guarantee. But on top of that, technologies like our ability to do evaluation and simulation, our ability to spin up things with Agent Twin — and really the key idea is that with these sorts of reliable and verifiable technologies, we're able to do the kinds of things that you've seen happening for math and coding. We're able to do this for the setting of agentic systems and conversational systems. And that really unlocks a collection of technologies and a technology stack that's all focused on reliability.

17:29 — Anita Ramaswamy

So I'm curious — when you're talking to customers, are they more interested in a model that is limited in its capacity but never hallucinates, or one that might sometimes be wrong but shows its work?

17:40 — Vinod Khosla

You're defining the wrong trade-off. But I'll let Dan answer.

17:45 — Dan Klein

Yeah, I think it's not that trade-off. You need reliability, but reliability doesn't have to come at the cost of intelligence. As humans, sometimes we use our general intelligence and sometimes — when the stakes matter, when things need to be right — we go to a specialist. And models that specialize have performance characteristics that allow them to be very accurate, very reliable, and very intelligent at the same time.

18:15 — Anita Ramaswamy

We have 30 seconds. I just want to ask one last one. Vinod, what do you think is the biggest misconception that investors have about AI right now?

18:23 — Vinod Khosla

I think AI is much more capable, but most people don't know how to leverage it or use it. Ike Gibson at John Deere doesn't know how to use AI. So there's a massive gap between the capability of the models and what is being deployed or what's usable, because of the gap of the people implementing it. I think that's a huge gap.

18:48 — Anita Ramaswamy

Awesome. Well, thank you so much. Great conversation.

18:51 — Dan Klein / Vinod Khosla

Thank you. Thank you.

‍

Scaled Cognition at HumanX 2026

Table Of Contents

Scaled Cognition at HumanX 2026

Key Takeaways

Full Transcript

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat."

Related Posts

Scaled Cognition Featured on Tech Talks Daily Podcast

Dan Klein on the Reliability Gap Killing Enterprise AI — and Why Prompting Your Way Out Isn't the Answer

Intelligence Is Not the Bottleneck

Ready for real CX agents?