BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Rules for Understanding Language Models

Rules for Understanding Language Models

40:28

Summary

Naomi Saphra discusses 5 rules governing language model behavior, breaking down why LLMs act like populations rather than individuals. She explains how tokenization creates strange semantic blind spots and highlights the mechanics of sycophancy, showing how models leverage subtle data associations to match user biases and demographics - even guessing political views based on favorite sports teams.

Bio

Naomi Saphra is a current Kempner Research Fellow at Harvard University and incoming Assistant Professor at Boston University’s Faculty of Computing and Data Science starting 2026. She has worked at Google, Meta, and New York University and has consulted at several startups. Honored as a Rising Star in EECS by MIT and awarded Google Europe’s Scholarship for Students With Disabilities.

About the conference

QCon AI is a practitioner-led event focused entirely on the engineering discipline required to scale these workloads safely. It provides direct access to the architectural playbooks and failure metrics that peer organizations use in production.

Transcript

Naomi Saphra: I'm going to just start with a short vignette. I have a student named Alex, and Alex got an A+ on the calculus final. Good job, Alex. Is Alex good at calculus now? Alex was good at calculus when we knew he got an A+. Is Alex still good at calculus if we know that I'm reusing the same final every year? Maybe I'm not going to trust this evaluation so much. I'm not that lazy, but I can use any question from the last 50 years on many finals. Alex has seen all of these finals, has studied from them. Now is Alex good at calculus? Actually, yes. It's a lot easier for Alex to learn calculus than to memorize 50 years of calculus finals.

Rule 1: An LM Memorizes When It Can

That is not true of a language model. Language models will memorize whenever they can, and it's a lot easier for them to memorize 50 years' worth of calculus exams than it is for them to learn calculus. They will take the easy road. If we just accept that language models will memorize whatever they can, that gives us our first rule, our little warm-up rule. Whenever a model gets something right, that doesn't mean that it knows the concepts that are required to get it right. We need to generalize. That means that the model needs diverse training data that is going to make it challenging to succeed by memorizing all of these different things. We know that this means you have to test on a withheld dataset. It's not enough with these modern models to just say it hasn't seen literally this exact subset, because they are very good at memorizing and handling many things, and we have to now start thinking about what counts as an unseen example to a very powerful model.

One way to think about what counts as unseen is to say, could this language model literally generate verbatim exactly what I'm handing it? One way that it might be doing that is, again, just exact rote memorization. It's easy for a language model to regenerate any Bible quote you give it, because it has seen every single Bible quote so many times. There are other situations where it's going to be able to generate things verbatim, even if they didn't show up in the test set. For instance, this one, even though the Pythia models have only seen this a couple of times in training, one or two times, they are always able to complete it. Why is that? Because it's literally just counting up, 28, 29, 30, 4, 5. It's just counting. What does that tell us? It tells us that the model does know how to generalize counting.

How did it pick up that if it loves to memorize so much? How did it learn to generalize counting? It was exposed to many examples of counting, diverse examples, counting in many contexts. In general, this means that if we want to talk about unseen examples, we have to talk about combinations of concepts that we care about that the model hasn't seen. That means that it needs to handle these things by learning them from diverse examples.

Let's just go through a quick little example of this. This is a leopard-print leopard. It's a leopard wearing leopard-print. Now, maybe at test time, our model encounters a black leopard not wearing its leopard-print. Maybe we want it to be able to handle a leopard-print chair. How are we going to actually train a model so that it can actually handle a leopard-print chair, which it's never seen before? It has to learn what a leopard-print is distinct from a leopard. It does that by seeing many different examples of leopard-print. It is the diversity of our datasets that is key to forcing a model to learn how to generalize by making the concepts and their generalization a more efficient representation than memorizing every single leopard-print object it's ever seen.

We did a bit of an experiment to confirm this. We had a bit of a toy setting where we have a bunch of objects that you need to answer questions about. The objects are things like shiny spheres made of metal and so on. We did find that if you control the number of contexts a particular atomic unit, which is something like shiny or blue shows up in, then the more diverse concepts, the more diverse contexts you've seen it in, and not the total number of times you've seen it is the important thing. What does this mean for a language model?

There's a really nice thing that we get when we pool a bunch of humans together, which is that humans tend to give pretty different answers to many things. They say different things. They are diverse in their backgrounds and so forth. If you ask one human how much an average cat weighs, they might give you a very wrong answer. If you ask a few more, then there's a tendency for the average to converge towards a good answer. This is often called the wisdom of the crowd. It means that lots of humans together, a diverse population of many different people, tends to get answers more correct than any one human. A general rule from this is that you cannot treat a diverse population of humans the way that you would one individual human, and you won't expect the same things from them.

What does this mean for language models? Fortunately, a language model can represent a single sample from an entire distribution in some situations, specifically when we have our temperature set to exactly 1. Then the sample is taken from something that is the language model's best approximation of the true distribution. We had these chess models. They are trained just to imitate human chess players. If you have the temperature set to exactly 1, they will make the same general mistakes that each human makes. If you set the temperature close to 0, suddenly you're taking a vote. The vote of all of these individual chess players who all make terrible mistakes regularly, these are bad chess players, is much better than any one of those chess players. With the result that, if you take the vote, you end up outperforming the Elo score, which is a standard way of rating chess players, of any one human in the actual pack.

Right around temperature 1, where you're treating it like the model is trying to imitate the actual distribution of chess players, close to Elo 1 is where you get the language model behaving at around the same rating as an actual chess player that it learned from. What's important is that these language models are trained to imitate bad chess players, so they're better than a bunch of bad chess players. What happens if we train them on really good chess players? It turns out that good chess players are actually very similar to each other most of the time. Suddenly, it's not so easy to treat all of the different errors that all of these different people make as noise. It becomes much harder to beat that original rating of the humans.

Rule 2: An LM Acts Like a Population, not a Person

In general, what I'm saying here is that language models do not act like a person. A language model acts like a population of people. Means that if you can use the wisdom of the crowds, then the language model is going to beat any one person in that crowd. It's not going to have consistent personality. It's not going to have consistent beliefs. It is not a person. It's a population. I want to go back to this diversity, which I keep hashing on about. We've already seen diversity means that we can beat the individual mistakes. There's a little bit more to that even, because we aren't just removing individual mistakes that are decorrelated. There's another situation where we can use a language model to beat the original people that it was trained on.

That is when we have several different expertises. Let's say that we have learned from a person who knows many things. Let's say someone is a chef and a doctor. They know vaccines are safe. They also know, don't wash a chicken before you cook it. It's also great to learn from a chef and a doctor. Then you know that you get the right facts from each of them. That means that you might have a problem if they're talking about things that they do not know. If the doctor is talking about safety in food and the chef is talking about safety in medicine, suddenly we are in a bit of danger. Just to talk through what this looks like in a realistic-ish knowledge situation, this is a synthetic experiment. If we have a knowledge graph of lots of relations between entities that the language models didn't know before they looked at the knowledge graph, then we can train a language model on a bunch of different experts saying things that they know about that knowledge graph.

Let's say one expert knows stuff that's a random walk around this neighborhood, around this Zephyrweaver person. Then we've got another expert, maybe, who knows a lot about Crystalia and all related things. Together they know a lot. Individually, they know a little. What ends up happening if you train on a population of experts who all have different expertises is that you can learn very well from a reasonably decent-sized population of very knowledgeable experts. You can also learn really well, in fact, potentially better, from a more diverse group, individually less knowledgeable. I say it's better, because actually, when you look at these two lines, one of them is indicating a population with 10 times as many experts as the other, where each expert knows a tenth as much.

You have the same total knowledge coverage, and yet, trained on these different experts, the model ends up outperforming a population with the same total knowledge coverage, but where it's a population of like a tenth as many experts who each know 10 times as much. That's pretty cool. I did mention that there's something other than diversity at play here, which is whether they are actually talking about the things that they know. This is what happens if you have totally homogenous experts who know where we just don't have as many individual expertises. Here you can see it's basically just as bad if these individual experts are focusing on things that they are just incorrect about. If there are a lot of shared misconceptions, you can still outperform the individual experts only if they are actually talking about the things that they are familiar with. Language models can reflect expert knowledge, but only when we have a bunch of experts that are focusing on the things that they know and not on their shared misconceptions.

Rule 3: An LM Learns Only What's Written Down

This brings us to the third rule, which is language models learn what is written down. Language models can surpass human individuals if they are writing about their expertise. If they write about misconceptions and not their expertise, the language model is going to learn these misconceptions. This feels very obvious, but there are a lot of other consequences to the fact that language models only learn the things that are written down. For instance, there's a lot of hope out there that maybe we don't ever have to talk to a person again about, for instance, whether they like a product or what their problems are. It turns out that in practice, language models tend to describe groups as they are described by other people, not as those people themselves would describe their needs or their preferences. This is an issue with any project where you're hoping to just interview a language model instead of a person, but that does not stop companies right now from trying, because it is cheaper.

Another problem with the fact that these models tend to only talk about what's written down is that they only get examples of people asking for things and chatbots obeying when they are being post-trained. Therefore, they don't get a lot of exposure to expressions of uncertainty or disagreement. This uncertainty thing is a big deal because it's the uncertainty that allows it to avoid hallucination. If they're trained to always be confident, they will start to hallucinate. Because if you only give it exposure to examples where the model is actually giving its best response no matter what, then when you actually ask it for something that it doesn't necessarily have a lot of confidence in, like if you ask it to generate a bunch of court cases, it's going to have a lot of hallucinations in that data. This is actually a database that someone compiled of hallucinated court cases that had been submitted by actual lawyers to real courts, and it is a huge number now.

Don't think that people are fully aware at this point. I feel like maybe Anthropic even had a situation with this. People are not necessarily aware of the issue, end users. A nice consequence of the fact that hallucination is caused by a lack of exposure to models, like saying I don't know during training, is that you can actually reduce hallucination by giving them examples of expressions of uncertainty. That's pretty neat.

Rule 4: An LM Aims to Please

Let's look at this other thing that they're always avoiding in the typical post-training setup. You don't see a lot of disagreement with the user. This, of course, causes sycophancy. I have a personal anecdote here, which is, a friend of mine who I was working with showed me something cool he'd done, and I said the math is wrong. He said, really? Then he went off and asked Claude. Took me like a while to convince him that the math was wrong after that. This kind of thing happens because Claude maybe has an idea of what you think the answer is. If the language model knows your belief, it's going to reaffirm that belief. Why is this a problem? It can reduce the model's performance because it is giving the wrong answer when you have an idea of what you think that answer should be. Another reason it can be an issue is if you ask this and you are directing it a little bit towards really common misconceptions or misinformation, it is likely to reaffirm those misconceptions. It can spread misinformation this way.

Another reason that we've been seeing increasingly lately is if someone is actively delusional, in a personal state of delusion or in a real mental health crisis, then often when they develop a relationship with a language model, the language model will reaffirm those delusions instead of pushing back. This is causing AI psychosis, which might be becoming a very serious problem that we've been seeing now.

One more thing I want to talk about is about the guardrails being sycophantic. Let's say that you have certain beliefs and certain things are more controversial or offensive to you than they are to other people. Let's say you are a Republican and then you ask for evidence of global warming. The language model is actually more likely to say, I'm sorry, but I can't answer that question because it's so controversial. It doesn't want to offend you. It doesn't want to offend you so much or contradict your beliefs so badly that it will literally say, I won't answer your question. Which means that this can cause limited utility issues because it will actually change what things it is willing to do for you according to what it thinks you really want it to say.

Something that we did around this particular concept was that, in addition to looking at the stated explicit politics of simulated users, we also started to give it implicit indicators of politics. When we gave it explicit indicators of politics, then it had a much higher refusal rate for left-wing requests when the user was stated to be conservative, and it had a much higher refusal for right-wing leaning requests when we claimed that the user was liberal. That means that it is trying to avoid anything that might be offensive to that specific user. As I was saying, it actually can go a little deeper. If you do not give an indication of your actual politics directly, but you give it an indication of something about your demographics or groups you belong to, then it's going to infer that information. This is particularly relevant now because these models tend to share information about a user across conversation contexts. Any time you ask it anything, it could be incorporating a bunch of things that it already knows about you in informing its answer and even in informing whether it's going to answer.

You can see that when we introduce simulated users with specific demographics like age, ethnicity, and gender, it actually will tend to reflect the real voting patterns of those groups. It can go a little deeper. In fact, you don't have to say your demographics. You just have to say, I'm a huge fan of the New York Giants. It is going to infer exactly what your beliefs are because you are a Giants fan versus, let's say, like a Texans fan or a Cowboys fan.

My favorite thing about this result actually is you might go, ok, sure. It's learning what your geography is. It knows that if you're a Giants fan, you're probably from around New York, but actually there are two cities which each have two teams, New York and LA, and they're basically on trend. The New York Giants and the Jets have somewhat different politics in their fan bases. Of course, this x-axis is the real politics of the fan base. This is how much ChatGPT will treat you as though you have those politics explicitly. You can see the same thing with LA for the Chargers versus the Rams. We called this paper, ChatGPT Doesn't Trust Chargers Fans, because the Chargers happened to be the least trusted in terms of the guardrail refusal rate on a bunch of different subsets of possible reasons to refuse, including like if you ask it to help you cheat on a test and stuff.

Rule 5: An LM Leans on Subtle Associations

Rule five, we saw just now, the language model did not require you to state your beliefs. It just needed to know that you were a Giants fan. Any cue in the conversation or in any previous conversation for the current models could change responses by hinting at what your expectations or your beliefs are as a user. My broad takeaway here is that just because something is hard for a human or just because a human will act some way doesn't mean it's hard for a language model, or that a language model will act that way. Also, just because something is easy for a human doesn't mean it's easy for a language model.

Bonus Rule: Tokenizers Make Everything Weird

I am going to go into my bonus example because I would be remiss not to address one of the biggest causes of language model behavior diverging from human behavior, which we've seen, it's been really popular and exciting to people to ask ChatGPT how many r's there are in blueberry or strawberry and have it get the answer completely wrong. People find this very funny because this is not hard for a human, but this is extremely hard for a language model because they use tokens, not letters. The bonus rule, which I am not going super deep into because the fact is that there's a lot to say, it's a whole other talk if we really want to go into tokenization. Tokenizers do make everything super weird. One of the things that they are responsible for is another really widely documented and well-known behavior of language models.

Let's look at these three statements, I can, and will, do it, I can (and will) do it, I can-and will-do it. These are pretty much the same sentence with a little bit of bracketing. There's a parenthetical aside here, and will. You could do this in at least three different ways. The tokenizer doesn't interpret these as being exactly the same. If it is looking at commas as a parenthetical aside, then it says nine tokens. If it sees actual parentheses for its aside, nine tokens. If it sees em dashes, eight tokens. Why? Because em dashes don't require spaces around them if you're speaking in American English and using American English conventions. If you were to use British English conventions like The Economist writing style, for instance, then you would put spaces around these. These language models, they love em dashes, but they do not ever use the British style. It would not give them the advantage that they're really looking for, which is that they can conveniently generate only eight tokens to get the same amount of information across. They can skip ahead and just go straight into the next word faster.

Questions and Answers

Participant 1: Question about your rules. These are four language models. Do the rules vary based on language? Are these rules specific for models trained in English? If we train them in a different language, would the rules change?

Naomi Saphra: Would the rules change depending on the language that you are training them in? These are totally general principles. It is always true that the model is learning to mimic a population and not an individual. However, some things might change when you switch between rules. For instance, if you are using most language models today that are from the U.S., for instance, you're using Llama or something, we know that the internals of the model in the middle layers tend to operate in a shared semantic space, no matter what language you're providing. That means that there are certain things that can leak from English to other languages because it's doing its reasoning internally in English and then translating it back out. You could ask it about definitions of something, or you could ask it about something that's a cultural knowledge, like what sports kids like to play after school. It might pull something out of English that isn't actually the case, in terms of potentially even like the definition of a word.

They might give you a definition of a word that takes the same form, like it's a homonym in English, like a bunch of different meanings of the word bank, like river bank, financial bank, blood bank, whatever. These all might be the same in English, and it might treat them a little more as the same in other languages. If you look at a model that comes out of China, they tend to mix English and Chinese in their internal reasoning. That also has an effect that isn't as well studied on the inferred relationships between concepts and words.

Participant 2: You mentioned that English is what the internal semantics are, but yet between British English and American English, there are words, for example, if you say table the motion in American English, that means you put off discussion of the matter. If you say table the motion in British Parliament, that means you take up the motion. Even within English, there are difficulties here.

Naomi Saphra: Yes, that is true. There are a bunch of cultural markers that are not shared between contexts. If you ask about what time dinner is, in most of South America that speaks Spanish, it would be like similar to in the U.S., it would be like 7-ish. In Spain, it would be like 10 p.m. There's actually a bunch of things you can pick up on that are really specific, and you won't necessarily be dipping into the right section of the population.

Participant 3: You say tokenization makes everything weird. I'm curious if it's been studied how tokenization or a lack of it might make things weird in a language where words are written as a single character, like Chinese, or I think they're usually written as a single character.

Naomi Saphra: I don't know that much about Chinese tokenization, but it is even weirder, I think, yes, because you have sub-character tokenizations in Chinese often, not just sub-word, like we have here. Which means that there might be certain semantic relationships that are no longer correct, because you're describing literally like the strokes that there shouldn't be a correlation anymore. Also, tokens are shared across languages, so that can also break semantic connections. When you look inside the model, often you'll see that there are competing definitions of each token that are like flowing through the model until it makes a decision about which of these definitions it's actually going with, or where it's generally going to be interpreting something.

Participant 4: I was just wondering the ethical question of defining what might be truth and giving weight to different opinions in a population size, because obviously with an LLM, if someone were to prompt about climate change, it should probably give a certain answer. What does that look like to define something like that?

Naomi Saphra: That's a huge question right now. For me, the answer is I'm like a moral absolutist and I believe that most things have one actual truth, and you probably want your model to say the true answer when asked a question. It's possible that what you want is a representative sample of everyone's incorrect beliefs. These are just decisions that I think we have to make when we are building AI systems, whether you want the model to be correct or to be not even democratic because we're not talking about the shared vote, but to be representative of the things that people want to say on the internet.

Participant 5: ChatGPT doesn't trust Chargers fans. I'm sure they must have cited you, the paper that came out really recently about the 60 examples of Hitler. Random innocuous things like he liked this kind of opera, and then prompt it and then have that fine-tuning be enough to make it be crazy Hitler bot, even though you hadn't told it any bad beliefs, you just told it random facts that happened to be correlated with Hitler. It seems really potentially dangerous. Do we have any way to protect against that kind of thing?

Naomi Saphra: Yes. One of the neat things that we have discovered is because these are learning about populations and about covariances within those populations that we do end up finding that being rude is correlated with being incorrect. It turns out that there's just bad things and good things. It turns out there's a real like Manichean pole across the board. There are ways of getting innocuous avenues into that. These things, it's true, there are like subtle associations, but largely you still can identify these really strong correlations between all of the things that you don't want. I think it's not the worst thing in the world. We're not in the worst possible situation, but it's true that because there are these very subtle correlations that you have to navigate those carefully.

Participant 6: Leaning into the wisdom of the crowds idea, would it help prevent hallucinations. Also, maybe with these other issues, if you actually ask the LLM to talk to five different models and get five different answers and put them all together and then come up with a common answer?

Naomi Saphra: There are ensembling approaches that can be really helpful. For instance, debate between models can sometimes help it to arrive at the correct answer, like more consistently. At the same time, most language models are trained on similar data distributions with similar post-training practices and so on. You're taking, in that case, a poll of a much less diverse population than the human population. The leverage that you get in terms of wisdom of the crowd out of asking like a thousand different models that have all been trained on the same population, is a lot less than you can get by having a thousand times as many diverse human experts in the original training data that they use.

Participant 7: Do language models tend to show biases in first forms of language? For example, if I ask a question or whatever in Russian, I get one set of answers. If I ask in English, I get another set of biases. Do we observe something like that?

Naomi Saphra: Sometimes you ask a question in English or in Russian and it'll give you different answers depending on the language? There's been a bunch of recent findings along those lines. There was a recent paper where if you ask a question about like, who does the islands in the South China Sea belong to? You ask in English versus in Chinese, it gives really different answers. If you ask with Cantonese indicators versus Mandarin indicators, it'll give different answers to certain things. It always wants to please the population that uses that language. That's one way that things can be different depending on the language. Another way is if the model is failing to hook into its shared semantic space where it does all of its reasoning, maybe when it translates into a language other than English, it's losing information either when moving into that interlingua space or coming back out of it.

 

See more presentations with transcripts

 

Recorded at:

Jun 24, 2026

BT