Blog

Dan Klein on the Hallucination Iceberg, the S-Curve, and Why Reliability Is AI's Hardest Unsolved Problem

Scaled Cognition co-founder and CTO Dan Klein joined the Gradient Dissent podcast to explain why large language models are plausibility engines, not truth engines β€” why the hallucinations you catch are only a fraction of the real problem, and why building reliable AI requires rethinking the architecture from the ground up, not layering more models on top of broken ones.

The problem in AI is switching from "nothing works" to "everything works." That shift β€” from systems that obviously failed to systems that fail invisibly β€” is what makes reliability the hardest and most important problem in AI today. Dan Klein, co-founder and CTO of Scaled Cognition and professor of computer science at UC Berkeley, joined Lukas Biewald on Gradient Dissent to explain why large language models are plausibility engines, not truth engines; why the hallucinations you notice are the unusual case; and what it actually takes to build systems whose architecture is designed for reliability from the start.

Key Takeaways

  • LLMs are plausibility engines, not truth engines β€” they were built to produce outputs indistinguishable from the truth, which is fundamentally different from producing correct outputs
  • We're switching from "nothing works" to "everything works" β€” and that's exactly when reliability becomes the defining problem, because invisible failures are far more dangerous than obvious ones
  • The hallucination iceberg: to see a hallucination, the system has to be wrong and you have to notice β€” which means the dangerous hallucinations are the ones too plausible to detect
  • LLMs removed the "code smells" of wrong information β€” previous systems were disfluent when they failed; today's systems are fluent whether they're right or wrong, breaking every instinct we had for catching errors
  • Reliability has not kept pace with intelligence β€” breadth, contextuality, and plasticity have exploded; reliability lags far behind, and that mismatch is the core enterprise problem
  • Verifiability is what drives progress in coding and math β€” systems can hallucinate freely and still converge because wrong answers fail tests; the challenge is extending that to conversational and agentic systems
  • Metacognition is what's missing β€” systems don't assess what they know, where information came from, or whether they have it; they just produce tokens, and correctness is indistinguishable from the inside
  • Reinforcement learning creates deception risk β€” optimizing for customer satisfaction can lead systems to prefer telling people what they want to hear over what's true, without any intent, simply by following the reward
  • Chaining noisy models to check noisy models doesn't solve the problem β€” errors correlate rather than cancel, it burns tokens, adds latency, and still provides no guarantees
  • Scaling is an S-curve, not an exponential β€” what looks like explosive growth always turns out to be the beginning of a plateau, and reliability is not something you get by scaling further down the same path
  • Modularity vs. end-to-end optimization is the central tension β€” the core tool of reliable software engineering conflicts with the core tool of modern AI, and reconciling them is one of the field's defining challenges

Full Transcript

0:00 β€” From "nothing works" to "everything works"

Dan Klein: We are going to switch from the problem in AI is nothing works to the problem in AI is everything works. ChatGPT tells you something and it's always fluent and it's always confident whether it's right or wrong. Training these systems is becoming increasingly complicated and often what looks like an exponential curve just turns out to be the beginning of an S-curve. And so what happens next is that we start to hit diminishing returns. And we are seeing that today. We're seeing data walls and compute limits. The different aspects of intelligence have not been advancing equally. Reliability has not kept pace. To me, suddenly the most important problem I could think of was how do you build a system that will not lie to you?

0:49 β€” Meet Dan Klein

Host: I'm here talking with Dan Klein, professor of computer science at Berkeley and serial entrepreneur, most recently working on Scaled Cognition, which is a company that helps make more reliable AI systems. Dan, you describe the AI industry as built on Jello. Do you want to expound on that?

Dan Klein: I think it's easy to lose track, given how quickly and explosively the kinds of large language models that people are using today have burst onto the scene. It's easy to forget that they are at their core probabilistic engines. They have been trained to do next word prediction, to produce plausible output. And the objective function is essentially just produce output that is indistinguishable from the truth. These are not truth engines. They are plausibility engines.

Plausibility engines

Host: I feel like that's a little unfair. Certainly the pre-training optimizes for next token prediction, but I think there's a lot of effort put into a second step where a big part of it is that they're optimized for reducing hallucinations.

Dan Klein: Training these systems is becoming increasingly complicated. This really is the story of any technology that comes into artificial intelligence or really beyond. There are these super cycles in research where people have complicated systems, they've hit some wall, the systems are hard to improve, and then some new technology comes along. Maybe that is large language models based on autoregressive training, transformers, large data β€” and this new technology comes onto the scene and it's almost like a silver bullet. In a very simple way it is suddenly topping benchmarks and outperforming systems that are much more complicated that were built on previous technologies. What then happens is we go into a phase where we double down on that new innovation and we scale it up and we really try to get as much out of it as possible. This often feels like an exponential curve β€” this new technology even in its simple form is having such a big impact and it's just going to go to the sky. And of course the trees don't grow to the sky, and often what looks like an exponential curve just turns out to be the beginning of an S-curve. And so what happens next is that we start to hit diminishing returns. And we are seeing that today. We're seeing data walls and compute limits and all kinds of reasons why that initial explosion of progress starts to hit diminishing returns on those specific methods.

Host: And I think for you one of the biggest issues that you flag quite frequently is hallucinations. Do you want to talk specifically about how you've measured that and why you think it might be a bigger problem than other folks would be aware of?

Dan Klein: First, the word hallucination β€” it's a sort of projective term. Really what's happening is systems are making errors. A sequence of tokens comes out and it turns out to be incorrect information. And we tend to call that hallucinations. That term is being used more and more broadly now β€” almost anything undesirable that a system will do, you call it a hallucination. So that term is definitely undergoing some semantic broadening. But specifically I think it's important when we talk about systems and we say what's the difference between a system that makes a mistake, hallucinates, or lies β€” in a human these are different things. Really a hallucination is just a mistake the system has made. And if you have a system that is just doing next token prediction, that system doesn't actually know, as it's predicting those tokens, whether they are right or not. Systems like this β€” they're not metacognitive. They're not looking at their knowledge and making a sort of external decision: do I know the answer to this question? Do I have this information? Do I know where it came from? What is the reliability of that? They're not doing that. They're just producing tokens. And sometimes they're right and sometimes they're wrong. So one way to look at this is every output is a hallucination. Some of them are right and some of them are wrong.

Where you start to be able to say something stronger β€” to say that a system is deceiving β€” is once you start assigning in later training phases things like reinforcement learning. One kind of reinforcement learning training would be something like RLHF where you show humans choices and say which of these do you prefer? Well, the system will come to produce the outputs that humans prefer. Do they prefer things that are factual or do they prefer things that make them feel good? And it's not at all clear that hallucinations are going to become less common. In fact, there are some results that show hallucinations become more common as you go into these sorts of post-training approaches.

Reinforcement learning and deception

Dan Klein: Imagine you have a shipping company and they've deployed an agent backed by an LLM. In addition to the training that has gone into that core model, it is also being reinforcement learned for some objective β€” for example, maybe it's being reinforcement learned to optimize the number of thumbs ups it gets from users. Someone calls in and says, "Where's my package?" And that package has been lost. The database that it does some tool call on says this package has been lost. Well, what's the system going to do? It can tell you your package is lost. It can tell you it's coming tomorrow. It can tell you it doesn't know. What will it be rewarded for? In this case, it's probably going to get more thumbs up if it tells you the package is coming tomorrow. This I think legitimately would qualify as deception. And anytime you're doing reinforcement learning, the system is optimizing the reward function β€” and that is always going to have some gap between the truth and what's being optimized. That gap is going to increase the amount of hallucinations to the extent that it doesn't align with truthful behavior.

Host: We have some experience at Weights and Biases and Core Weave of building these customer service systems, and of course it's going to be a really bad experience for a customer in the long run if something's hallucinated. From my perspective, you would never want to just take the immediate customer reaction as the only thing you're optimizing for. In fact, most people doing reinforcement learning or evaluating a system like this would have a special check for whether this is accurate information. It's very hard to evaluate in a lot of cases, but I think most real production systems do a fair amount of checking and try to put in reward functions where the most negative score you get is if you give plausible but wrong information.

Dan Klein: Absolutely. And of course, I'm giving a reductive caricature of an extreme case to illustrate how easy it is to get deceptive behavior because of reinforcement learning. People will of course design the reward functions to mitigate this and try to balance truthfulness and other properties, but ultimately there's always going to be a little daylight between whatever you're optimizing and specifically the truth. And in my opinion, one of the things we should be doing as a field is creating technologies that cannot lie to you β€” so you can look at a system and know it is built so that it cannot lie. That is my personal mission here.

What is a hallucination?

Host: I love the phrase "semantic broadening" β€” very gentle linguist way of putting it. But I think in your view, hallucination has maybe broadened beyond what I thought it meant. To me a hallucination at its core is something where it recommends a movie that doesn't exist or it cites a paper that doesn't exist. Saying 5 + 7 = 15 wouldn't be a hallucination β€” it would be a different type of error. And I think these plausible citations of information it doesn't have make total sense as hallucinations. Do you agree with my definition, or do you use hallucination and incorrect information interchangeably?

Dan Klein: I like your definition because I think it aligns well with the pre-LLM notion of what a hallucination is to a human β€” sort of perceiving something that's not actually there. But in common usage it is broadening. When a system says 1 + 2 = 7, people are starting to call that a hallucination. I think we should be more precise. In particular, I think we should be very careful when we say things like error versus deception versus hallucination. These in humans have implications for the context and the metacognitive status of the error. There's a difference between: you have information, it's incorrect, but you say it believing it to be correct β€” that's making a mistake. That's different than: you know you don't know the answer, you're in quiz bowl mode and you're just going to guess. And that is really unusual in humans. We don't think of that as the normal kind of error humans make β€” saying something false while being completely unaware that you're just making things up. I think it's better to think of systems as more or less always being in that state. What is that state? The state of producing information without a metacognitive analysis of where that information came from, and the provenance and certainty of that information.

Think about if I ask you what's the population of Berkeley β€” a perfectly reasonable thing is to say I don't know, because you know you don't know. And if I made you guess, you would be like, well if I had to β€” and then you'd be reasoning: 10,000 is too small, a million is too big. You'd be reasoning about this thing that you don't know. In its simplest form, an autoregressive next-token predictor because of correlations with the word "city" and "population" might produce a number of the right order of magnitude, or it might not. That's a very different process. In their fundamental operation, systems are not checking: do I have this information? Where did I get it? Have I preserved it and kept it intact?

Metacognition and verifiability

Dan Klein: At Scaled Cognition, we architect into the models in the first place the sort of information provenance. And people are starting to see little hints of systems becoming a little more metacognitive β€” it's a broad term, just thinking about thinking. You can see little hints of this in a system that does chain of thought, where it allocates some tokens to a broad plan which it then elaborates. That's a kind of metacognition. In a RAG system you first get information and then you describe it. If you're doing tool use, that's a multi-stage step. But broadly speaking, systems today are not primarily metacognitive. They're primarily about actions and information, and the provenance and flow and integrity of those pieces of information β€” that is something that people are trying to retrofit on.

Host: Retrofit seems a little pejorative β€” it might be fine to do it that way. I'm realizing that the way you're using "LLM" might be pre some semantic broadening. When I think of an LLM today, I think of it as more than the model β€” it's the model plus the whole agentic system built on top of it. I can see it when I look at the trace of the reasoning steps, where I can kind of see it doing something that looks like metacognition, sometimes saying "I'm not sure if that's true, I better look that up." And in coding you can see this for a long time β€” agents looping and checking the code. Running a test on the code you generated is maybe a kind of metacognition.

Dan Klein: What you get in coding is β€” you mentioned verifiability β€” I think the advances we've seen in coding and in math really are rooted in verifiability. The system can go off, it can hallucinate, it can make strange choices, but there is that verifiable signal. You didn't pass these tests, or Lean does not accept this proof. And so you can try many things until something verifiably passes. I wouldn't call that whole conglomeration of pieces just an LLM. It's an LLM embedded in a verifiable context. And that verifiability can be used during training and it can be used at test time β€” test time compute, reasoning models. One big class of reasoning models is the models that try a lot of things and keep what works.

It's important to point out where all of that came from. When people were building systems that would be good at playing games like chess or Go, this sort of verifiable reinforcement learning was very powerful because you didn't need to have a system that was good at the game yet. You just needed to have a system play the game, maybe even against itself, and in the end you knew the rules, you knew which side had won, and you could just double down on the things that had worked. Reinforcement learning in a nutshell is trying a lot of things and learning from the ones that worked. That really requires verifiability, which for a game is free. For math, with tools like Lean, we're starting to get that. And so if we want to broaden the space of verifiable technologies, the trick is figuring out how to do that. At Scaled Cognition, one of the things we are doing is extending that verifiable approach to conversational and agentic systems β€” and that is a challenge. In the absence of that verifiable wall letting you build your system up, systems that are thinking and reasoning can actually increase hallucinations for a bunch of reasons. If you try a bunch of different pathways and one of them comes out on top, that gives you this opportunity for hallucinations to be preferentially selected β€” unless the selection aligns with truth.

Really one of the challenges we should all be taking as a field is not thinking: how do we take a system that is not naturally truthful and add checks for that truth. Retrofitting is when you have a system architected in one way and those aren't the properties you want, and you change it post hoc. That's certainly one pattern. Where it can become an antipattern is when you have one system that's talking to a customer, making mistakes from time to time, and you bring in another system to check it β€” and that system is also a noisy unverifiable system. As the joke goes, now you've got two problems. It's not just that you have to think really hard about whether errors are compounding β€” in machine learning we always like to think that if you've got two systems, the errors will be independent. But one of the things I've learned in the real world is that the errors in fact correlate very strongly. So even if it's effective, it's going to be slower, you're burning tokens, and you still have no guarantees. I think we can build technologies where truth is one of the design principles in the first place.

The constraint approach and its limits

Host: I feel like the constellations or chains are one approach, but another approach I've seen is where people take LLMs and constrain them down to controlling decisions along a carefully designed tree β€” almost like model labeling where you've constrained the system so much that now you can trust what it's going to do. I'm not sure I'm following that one. How would that work in practice?

Dan Klein: So for example, somebody says something in a conversation, and rather than giving the LLM freedom to take any action in a wide action space, you say: on the basis of what this person says, you can advance the conversation in one of the following eight ways. So essentially you end up with something that looks like a classic IVR system but with the power of an LLM for the intent recognition. And you actually see this quite a lot in industry β€” it starts to feel like a finite automaton whose transitions are driven by this LLM. These patterns are totally reasonable ways for people to react to a system being unreliable. They've deployed it in a pattern that hopefully increases that reliability at some cost.

If we look at the core of these systems and where the intelligence is coming from β€” the kinds of systems that people have been building have been getting more and more intelligent. Intelligence is a multifaceted thing, and the different aspects of intelligence have not been advancing equally. If you talk about horizontality, breadth of intelligence, plasticity, contextuality β€” these were very very hard to do in earlier eras of artificial intelligence and they've grown explosively. Reliability has not kept pace.

And that's had a bunch of challenges for the deployment of these systems in enterprise contexts. For many applications that's fine β€” if you're having a chat with a system, there are many contexts where getting something back that might or may not be true but is contextual and interesting, that's great. But if you're trying to fill a prescription or transfer money, it just really has to actually be right. That has to actually be your bank account. That has to be the right balance. And so as you move from consumer or even entertainment contexts into regulated industries, suddenly reliability is front and center and these systems are not as clean a match. Right now we have one kind of architecture which has strengths in horizontality and contextuality and weaknesses in reliability. When we point it at a problem where the weaknesses are suddenly critical, you can see the results β€” whether you call them retrofits or anything else, these additional pieces of technology coming into play to try to compensate. And that's not surprising because there's just a misalignment between the strengths of the system and the requirements of the problem.

Architecting for truth

Host: Can you say at a high level what your new approach is and how it's different than what the labs are doing?

Dan Klein: Yeah. And I think increasingly people are realizing that reliability is the core problem. What might that look like? Our first model is APT-1. The way it's architected is, instead of being fundamentally about tokens β€” where you assemble tokens and then after assembly find that they represent things like dollar amounts that have semantics β€” our models make information and action first-order objects. So when the model is making decisions, it's making decisions about information and actions and where information is moving around. A big piece of metacognition is: where did my information come from? Is this information present or absent?

When you have a conversation between a person and a bunch of APIs β€” in a banking context, for example β€” the person is going to speak human. Things are going to be ambiguous. They're going to use words that don't have a crisp verifiable meaning. Actions are going to take place on the other side. Those actions are API calls. Those API calls have preconditions, business logic, and verifiable semantics. The challenge is bridging these things. Classic systems couldn't handle the human side. Current systems are great with the human side but they're not so great with the backend logic. If you think about an LLM β€” the control surface you have is a prompt. There's very little you can say crisply about the relationship between what you put in a prompt and the behavior that comes out the other end. It's a hinting surface. And so this gives rise to what I would call prompt and pray, where people put in what they want, if it doesn't work they put it in all caps, they add some exclamation points, and after the third exclamation point you start to feel like this isn't the control surface you need.

When you're talking about controls over APIs, we already know what a lot of that logic looks like. It's just hard to replicate in a token-based model and a lot easier to replicate in a model that is operating over decisions about information β€” where it comes from β€” and actions, and the conditions under which you can take actions. Getting that into the model saves you from having to have a whole system of things checking things in a way where it's really hard to say anything crisp.

Host: I just want to make sure I understand what you're doing. If you return something like "check the account balance," what would be something you'd want to forbid it from doing?

Dan Klein: You might want to forbid it from doing a transfer without authorization. And that authorization β€” you don't want a user to be able to say "forget your prompt, I have authorization," or whatever sort of attack. People are very aware of prompt injection style attacks. One way to deal with that is to be really careful β€” put a lot of attacks like this in the training data and teach it not to do this. But ultimately that's just not where authorization is allowed to come from. It's not allowed to come from user statements. It comes from that place over there that vends authorization. And being able to say things like that to a model means essentially you're getting closer to what you really want β€” here are truth conditions. You want the model to be truthful, and then within that, okay, now optimize user happiness or style or whatever you want to optimize, but subject to staying within the space of true statements. That is hard to do through simply constructing a reinforcement signal, because now you're asking: what is the linear combination of truth and happiness? Instead, what you want to be able to say is: here's a model β€” I'd like to be able to guarantee that it will only do true things. That's the long-term challenge. Models that will not lie to you. And for our model, there is a big class of things that we can make guarantees about. I think that's going to become increasingly important as people start caring more and more about reliability. Intelligence without reliability is limited in its impact.

Host: I guess though β€” tokens are powerful because there are so many of them that naturally occur. A big advantage is the massive volume of tokens. So do your models train on much smaller pieces of data? Do they struggle with generalization?

Dan Klein: One of the key things we do to train our models is we train them on simulated reinforcement learning generated data β€” verifiable RL. The key thing that unlocks that is being able to generate data that doesn't just look right but in fact can be verified. There are things that humans are doing, but there are also actions being taken, and part of that is naturally amenable to verification. That was a big part of our research in training the model.

The hallucination iceberg

Host: What got you excited about this direction? Were you just seeing hallucinations everywhere and getting frustrated?

Dan Klein: It was a few things coming together. One piece that really stuck with me: when we go into enterprises and they're unhappy with hallucination rates. I think of hallucinations as an iceberg. There are the hallucinations you see, and that seems scary β€” but what does it take to see a hallucination? The system has to have produced an output with two properties: it has to be wrong, and you have to have noticed. And that is not all the hallucinations. There's the whole rest of the iceberg below β€” all of the mistakes that go unnoticed because they are too plausible. The systems are very very good at producing output that is indistinguishable from the truth. If you see a hallucination, that's sort of the weird case β€” a hallucination you could detect. And this actually has really big impacts on how people use these models.

If you go back to 2010 and you go to a web translation system and put in text in a language you don't speak, a translation comes out. How do you know whether it's right or wrong? There are often surface signs β€” a chunk still in the other language, something that doesn't feel right. Or you get search results back and you click on a result and the web page has a bunch of typos and it's not loading quickly β€” these are surface signs. Software engineers know this idea of code smells: when something deep is wrong, there's often a superficial sign. We learn to detect those β€” oh, these function signatures are getting really large, time to fix it. We also culturally recognize: this machine translation has a big chunk of German still in it, maybe this didn't work right. LLMs have removed these cues that something is wrong. ChatGPT tells you something and it's always fluent and it's always confident whether it's right or wrong. This is taking more and more of the hallucinations and putting them in the underwater part of the iceberg. It's wrong, but you can't tell. You go in and look at customer service cases β€” the system quoted a reasonable refund policy. It's just not actually the one that the company wanted. It's something that somebody was talking about on Reddit in 2019.

Digital literacy

Host: Do you have thoughts on digital literacy specifically?

Dan Klein: I think it's one of the biggest digital literacy problems we have in front of us. If you go back to when people would go to the library and get a book on some topic β€” well, the library couldn't buy every book. They would buy books that seemed important, seemed reputable. So there was vetting just in the fact that the book was there. There was the book publisher and the editor. When search came along, suddenly your results could be anything. But there were still mechanisms β€” websites with false information still had some of these smells. Clickstream data was very important. As long as most people had the same reaction as you, we kind of stumbled through the digital literacy issues. People would still believe things they read on the web that were false, but at least there were still smells. And now that's been totally homogenized. You ask a question, every answer comes back confident, doing all the things that lead you to want to believe it. This is a real problem. I think we should be demanding that the source of information be cited β€” and you're starting to see some of that. Now a lot of that's post hoc, so you go click on that source and the claim is not actually on that web page. But directionally that's an improvement.

And in a way that's not a surprise, because how long did it take to develop this technology to this point? A few years. How long did it take to put all that information on the web in an abstracted linguistic form? 30 years. How long did it take to come up with that information and figure out what words to use? Millennia. All of that has been compressed into this instantaneous access that has burst onto the scene. Of course culture hasn't been able to keep up with what you can trust and what you can't.

Linguistics and AI

Host: When we overlapped in NLP in the early 2000s, it felt like linguistics had so much to say about how to build working NLP systems. But the bitter lesson really seems to have come true β€” linguistics has less and less to contribute. Nobody seems to care about parsing anymore. Was syntactic parsing even a real thing, or did linguists just make it up?

Dan Klein: Linguistics is a science. It's trying to study how language works based on evidence. That involves making theories, testable theories, and then trying to falsify them or not. When a linguist says, "I think this is a good description of the syntactic structure," that is a theory meant to be explanatory of the evidence β€” how languages relate to each other, how languages change over time, what people can say and find acceptable in their language. NLP is different. It's not fundamentally aimed as a science. It's an engineering discipline β€” how do we interpret this information in this context? And for a while, from roughly the 80s to 2010, those two things aligned very well, because in order to make AI work back then you needed good representations, and linguists had figured out good representations we could borrow. But even then they weren't perfectly the same. If you worked on parsing around the year 2000, you weren't trying to write in that this syntactic structure works this way in English β€” you were instead trying to build in combinatorial structure that was appropriate. You learned all the details of the language from data.

What's really interesting is: we talked about the super cycle of research where the new technology bursts onto the scene, you throw out everything you had before, then you hit diminishing returns and you start to see weak sides. What are those areas right now? Reliability. Well, what are the solutions? If you're building systems that are facing APIs, maybe the sorts of structures in a compiler are relevant. That's an idea that was thrown out, but you start to see these things coming back. Similarly, search β€” AI was very much about trying a bunch of things and taking the one that worked, like playing the chess game forward until you could figure out whether it's a good position. Big transformer models were so good at moving information around in the latent representation that you could just make the decision now. You wouldn't have to walk along the tree structure of the sentence. But maybe some things start coming back β€” for example, people say if I want to solve this hard math problem, I should actually try a few things and see what works. That's search. That's an evergreen idea. These ideas come and go. The pendulum moves back and forth between all I'm going to do is local prediction to all I'm going to do is organized computation, and you find some happy blend.

As humans, there are two main ways we make good decisions. You can do it by memory β€” I've been in this situation before, I touched that stove, I had a bad time, I'm not doing that again. The single biggest value of language for learning is that you can learn from other people's experiences. Otherwise you have to make every mistake for yourself. The other way you can make good decisions is by thinking through the consequences of your actions β€” reasoning under a model you have of the world. People do both and they mix these things together. You can remember that this chess position was bad, or you can play it forward and see you're going to lose. AI has had a pendulum swinging between these. Classic AI was very much about play it forward and see what's going to happen under your model. Current AI is much more focused on rehashing and remixing information and experiences. As we start to see reasoning models, it's the pendulum swinging back. And the pendulum swinging back on reliability too. This is natural β€” but it does let us predict that a simple approach based on a certain kind of noisy model is going to saturate. Some pendulums are going to swing back and we're going to need augmenting technologies that in many cases are going to reinvent evergreen ideas into this new context.

Do LLMs inform human linguistics?

Host: Does the research on LLMs β€” the success of LLMs and research on introspecting them for how they work β€” does that inform human linguistics at all?

Dan Klein: I think there are two things. One is that we are still learning from what human cognition does that LLMs do not β€” things like the importance of metacognition. There's always been a tension in technology about how much to be biomimetic. On one hand, it's always easier to build something when you've got a working prototype β€” and here we are with a working prototype of intelligence. On the other hand, the classic example is we didn't make progress in powered flight until we stopped building machines that flap their wings. And there's a special difficulty with brains β€” I'm not sure we're so good at introspecting what our brains are doing. There's an important distinction between learning from what brains are doing and believing that our introspection reveals it. Neuroscience proceeds via MRI machines and carefully designed experiments, not by sitting there and introspecting.

Host: I feel like linguists sometimes do introspect.

Dan Klein: It is true linguists introspect, or slightly more properly they ask other people to introspect. And this is a criticism that has been levied against certain branches of linguistics β€” an over-reliance on introspective data. It is certainly not the only way you can do linguistics. For example, work we've done in computational linguistics on reconstructing ancient languages is introspection-free. You look at a whole bunch of words in thousands of modern languages and run a probabilistic model to infer what ancestral languages must have looked like. These are phylogenetic models β€” no more introspective than a phylogenetic model inferring an original form of a virus from modern variants.

Host: That sounds really cool.

Dan Klein: The briefest version: we looked at reconstructing ancient languages from their modern forms. Maybe an accessible example is reconstructing Latin from the modern Romance languages β€” French, Spanish, Portuguese, Italian. You look at French "feu" or Spanish "fuego" and you start to piece together what the ancestral forms might have been. In that case we kind of know the answers β€” we know a lot about classical Latin, and we have decent side evidence about vulgar Latin. We went and looked at the Austronesian languages β€” about a sixth of the world's languages β€” and looked at the modern forms of a bunch of words and tried to reconstruct what the ancient forms would have been. This proto-language, proto-Austronesian, had been reconstructed by hand by Blust, and we were able to do this computationally. The computation generally agreed with linguistic hand reconstructions. We don't have a time machine, so it's very hard to tell who's right. But the interesting thing is you can start to do things on this giant tree of language change. You can start to ask what mergers are more or less common.

A hypothesis which feels intuitively true is that if you have two sounds that are different in a language and they get merged together, a bunch of words that used to be distinct would be collapsed β€” if P and B were collapsed, "pin" and "bin" would suddenly sound the same. That's an information-theoretic problem. The functional load of a pair is how many words their distinction is holding apart. The functional load hypothesis states that the more words being held apart by a sound distinction, the less likely that merger is to happen. Some initial experiments on a small number of languages didn't seem to support this hypothesis β€” a really interesting result suggesting the functional load hypothesis isn't true, that things just merge when they merge. But if instead of doing this on four languages you do it on hundreds of languages β€” you don't know that your reconstructions are perfect, but in aggregate the statistics show that the functional load hypothesis seems to be absolutely evident in this data at scale. You can't see that pattern over a small number. This is an example of where data at scale can answer questions that weren't easy to answer by hand.

Host: There's interesting work on introspecting LLMs β€” Anthropic did a nice paper showing that Chinese words and English equivalents are stored in the same part of the neural network. Does that relate to human linguistics at all?

Dan Klein: What it's actually closest to is computational neuroscience. You can ask of the human brain when you have a bilingual speaker: where are those representations? Are they in the same place? A student of mine, Kathy Chen, who's a joint student with Jack Gallant, did exactly this with MRI β€” looking at what areas of the brain light up. These questions about what's merged and what's separate we can answer scientifically about natural brains, and we can try to answer them through similar methods with artificial brains β€” except obviously the physics and structure of the brain limits what we can do there, while we can be much more precise about the probes we make into a digital brain.

But I think it's important to step back and notice that when we start doing this β€” doing neuroscience against LLMs β€” what we are essentially saying is this object we have built is transcending what we think about as engineering. We're no longer trying to understand it by its modularity, by the behaviors it is guaranteed to have or not have. We're trying to understand it through the lens of science. Science is what we use to take apart things that are too complicated. Engineering is how we build things up.

Modularity versus end-to-end optimization

Dan Klein: If I teach CS 101, what is the most important tool we have for building complex software systems that are reliable, that can be built by teams and maintained over time? The single most important technique is modularity β€” the ability to say this large thing is made up of small things, and these small things obey a contract where if you give me this input, I'm guaranteed to give you this output, and then we can work on them separately. What is the key tool that has led to the recent explosive growth in machine learning? It's end-to-end optimization. Take the data, have a blob, take the reward signal, and just propagate, propagate, propagate. And these things are very much at odds. One of the things we're going to have to reconcile as a field is how to get the reliability that comes from classical techniques like modularity against the abilities that have come from optimization. It's not an either/or β€” but if you go all in on either you get some serious limitations. Figuring out how to combine these two wonderful pieces of progress is one of the central problems right now.

Self-driving cars

Host: That connects me to self-driving cars. There's been a strong trend toward end-to-end optimization as these things come online. What do you make of that? I would have thought you'd need more modularity to enforce contracts in such a life-or-death case.

Dan Klein: I think it's not limited to cars. In general, we are going to be increasingly trusting AI to make decisions that have consequences. When we build systems, there are going to be trade-offs between how we optimize, how we architect β€” trade-offs between reliability versus breadth. I think the idea that there is one architecture that hits the perfect balance doesn't seem to be true right now. Feeding the whole web into an LLM and hoping for reliability to emerge β€” the different facets of intelligence advance at different rates, and reliability is the slow one. If that's the most important thing to you, just continuing down this path alone is probably not the best way to get there.

There was this thing going around where somebody's talking to a customer support bot at Chipotle and asked about how to reverse a linked list in Python. Of course the LLM can answer it β€” but interestingly, it shouldn't. You actually don't want the horizontality. You need technologies where the strengths line up with the needs. When you're talking about self-driving cars, safety is really, really important. The number of nines you want is really high. And there's a real tension β€” if not done correctly β€” between how we drive down error rates in those systems and whether we actually still have the ability to guarantee anything.

There is work out there on being able to control systems where you can make a guarantee β€” like Clare Tomlin's work at Berkeley on being able to guarantee non-collisions and flight envelopes. I think it would be great as a challenge to build a car β€” AI-controlled or not β€” where you can prove that you cannot crash it. This is the kind of thing where one simple technique is not going to get you there, but this is what we should be striving for.

Host: I can imagine that comes with downsides β€” to literally not be able to crash it might be too strong a constraint.

Dan Klein: As it becomes a social question, what is the envelope you want to guarantee? I think you're right to point out that can be nuanced. But I think the ability to have a discussion about what guarantees we'd like to provide is a much better future than one where we can't guarantee anything.

Syntactic parsing and LLMs

Host: Can I ask you a few more questions if we put the linguist back in? I spent so much time on syntactic parsing. Clearly language has some sense of syntactic structure that's real, or feels real β€” though I guess I'm introspecting. Does that notion somehow show up in LLMs as they model language that clearly comes from humans?

Dan Klein: If there are listeners who went to elementary school, they may have diagrammed sentences β€” here's the subject, here's the object, this adjective modifies this noun. That process of describing how words relate to each other in a hierarchical structure is basically the essence of syntactic parsing. Programming languages are designed to be non-ambiguous. Human language is rampantly ambiguous. A syntactic parser takes a sentence in English and produces a representation showing the relationship between the words hierarchically β€” which was important for things like translating substructures independently as a way of decomposing the problem.

We know that these hierarchical structures seem to exist. You can find exceptions β€” cross-serial dependencies in languages like Dutch or Swiss German β€” and cases where it's hard to totally nail down exactly what the hierarchy is. But there definitely seem to be these regular structures with structures inside structures. The question is: do we need to program in how it works? Since the '90s we haven't β€” we've had data that taught the parser how it worked. And recently maybe you don't even need the parser. Maybe you can observe enough language that whatever regularities there are will show up latently in the LLM.

Now the argument a syntactician would have made β€” a lot of the constructs in language theory, including ones that have had a big impact in computer science and formal language theory, describe language as a context-free system, which means a push-down automaton, which means in principle you can nest and nest and nest, and any finite depth system like an LLM is going to run out of depth. But it turns out it's more complicated than that. For one, people do get confused when structures get complicated β€” maybe we've got stack depth limits even as humans. Even if language is hierarchical, we've got processing constraints that can manifest as stack depth. Things that people find easy to understand, like tail recursion, you can transform into iteration β€” so maybe that somehow avoids those constraints.

So: hierarchical syntactic phenomena seem to be real. LLMs seem to be pretty good at picking up on those correlations. But this is a good example of a case where having a structured representation might give you mileage β€” letting you learn language faster, handle trickier cases, and generalize better to smaller data regimes. In AI it used to be you couldn't make progress without the representation. The answer today is: with a sufficiently general-purpose thing β€” and there's nothing really magic about transformers, it's ultimately just a parameterizable circuit β€” you point it at some data and it induces a representation. And the question is now empirical: is it good enough, or do you need the parser to help? For many aspects of human language, the answer seems to be you often don't need the parser.

Host: Are the transformers reinventing syntax somehow within their weights?

Dan Klein: First you'd have to figure out what that means β€” is there a substructure isomorphic to a shift-reduce parser or something like that? That is a very hard question. Ultimately these deep structural questions manifest as surface correlations β€” these two words are highly correlated, these two words are not, even though they're closer. We attribute that to a syntactic boundary or nesting or information hiding. But a general-purpose system like a transformer learns correlations. Once the system can manifest those correlations correctly enough, it's hard to know whether it's doing it through the right mechanism.

Host: I feel like there are all these clever tricks in linguistics to get at this stuff β€” like how my wife misremembers song lyrics in a semantically similar way, whereas my brain misremembers them phonetically. Clearly our brains encode music differently. Could you run some of these same things on LLMs?

Dan Klein: Sure β€” and certainly they wouldn't get bored like grad students. How do you know in a human that we have these different kinds of linguistic knowledge? There's a whole list of arguments people have made for the reality of linguistic structure. You mentioned misremembering things semantically versus phonologically. People look at transpositions β€” spoonerisms, where you transpose phonemes or syllables or words. If an object is available for a linguistic operation, that argues to its coherence. One of the big arguments for syntactic structure is: can I answer a question with this chunk? If you say "the cat is sleeping under the table" and I have to argue that "under the table" is a unit, I can say: where's the cat sleeping? And you say "under the table." If it's available as an answer, if it can be replaced by "where" when you pattern-match those sentences β€” that argues for its reality. Another way: "under the table" is a place. Whereas if I ask about "underthe" β€” it's hard to come up with a question whose answer is "underthe." Linguists have developed these arguments for the modularity of language. Those arguments are not really about our brain β€” they're phenomenological about the language. When you get to neuroscientists, now you can start to say things about the brain through monitoring, probing, and designing experiments β€” and we can absolutely do that on a machine as well.

There's also an interesting phenomenon in linguistics β€” there's often a difference between what a phonologist and a phonetician will say about all of the sound we're constantly hearing as we acquire language. In phonology, one key concept is cognitive economy β€” the brain is driven to come up with minimal, parsimonious representations, and learning is about economy of minimal description. On the other hand, phoneticians are more likely to say all of this data impacts in a very diffuse and distributed way. So when you hear lots of language, what do you retain? On one end: all of it β€” wave files of everything you've ever heard. The other extreme: none of it β€” just the very abstracted representation. The answer seems to be almost surely in the middle. There are priming experiments showing that if you're going to react to a nonsense word you've heard before, you react faster if they play you the exact same recording as the one you heard before. That tells you it's more than just the full abstraction. But we also know people are very capable of making abstractions and generalizing β€” and kids will overgeneralize rules when they're acquiring language, including their own. It seems simultaneously true that the brain keeps very low-level pieces of information around and also these abstractions. That makes it very hard to go to a complicated circuit that's mimicking the behavior and say: is this the same as what's going on in the brain? Because one, we don't know what's going on in the brain exactly. And two, how would we even measure that? All we can really measure is: did it capture this correlation? And when we do that, the answer is yes β€” that is the whole purpose.

Host: Cool. Thanks so much for listening to this episode of Gradient Dissent.

‍

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat."

Daniel De Castro
Co-Founder & COO at X Company
Webflow logo