The story so far: One of the main functions of this newsletter is as an archive for explanations I find myself giving repeatedly to people. After about the third time I hear myself using a particular analogy or explanatory frame in an interview or private conversation, I think, “I should write this down so I can refer people to it.”
In that spirit, I want to take yet another crack at explaining how a large language model works. I’ve done this in previous articles (here and here), but this time I’ll do it on a very different plane of abstraction than I’ve used in earlier work. Actually, the explanation below is on two different planes of abstraction. I’ll start out with one picture then I’ll nuance it further for readers who want to go a bit deeper.
The title of this post is a play on Thomas Nagel’s (in)famous 1974 essay, “What Is It Like To Be A Bat” [PDF]. The homage is superficial because, unlike Nagel, I am not trying to get at some subjective quality of ChatGPT’s inner experience (whatever that is). Rather, I want to explain an aspect of language models that’s much more concrete and mundane, but that’s nonetheless widely misunderstood and the source of much confusion: how state functions in an ML model.
“State” is an overloaded term, especially in machine learning. In this post, I’m talking about the simple computer science concept of state as the information you cram into memory for the machine to access and work with. Let’s look at Wikipedia’s definition:
In information technology and computer science, a system is described as stateful if it is designed to remember preceding events or user interactions; the remembered information is called the state of the system.
🧠 If I ask ChatGPT, “On what continent is the nation of Tanzania?” the model will have to reach into its state — all the information it remembers about the world from its training run — to answer this question.
😶🌫️ If I ask ChatGPT, “On what continent is the nation of Barbaristan, the fictional setting of the unpublished, barbarian vampire mystery romance written by Jon Stokes?,” it can’t possibly know what I’m talking about. It may confidently make something up — LLMs still tend to do this — but when it does so it’s doing something that’s a bit more like a Google search not finding an exact match for a query and returning a “Did you mean…” set of results, instead. The model reached into its state and pulled something out, but that something wasn’t a good match for my input query.
But by now, maybe you’ve read enough of my newsletter posts or you’ve heard enough podcasts on AI to have a bit of familiarity with the concept of the token window. If so, you’re aware that you can actually get ChatGPT to answer questions about an unpublished fictional work by Jon Stokes, provided you have a copy of that work you can dump into the token window alongside your query. The model can reference the text in the token window to extract new facts and information that’s fresher than what it was trained on.
Here is what it’s like to be ChatGPT (the simple version)
When it comes to state that captures information about the world, OpenAI’s model is like Rip Van Winkle, the character in Washington Irving’s 1819 short story who famously over-imbibed and fell asleep under a tree for twenty years. When he woke up, he found he had slept through the American Revolution.
If ChatGPT’s training cutoff was, say, September 1st, 2021, then the model quit learning new information on that date. In computer science terms, we can say that its state was last updated on that date.
Every time you open a new session with ChatGPT, it’s like you’ve found it asleep under a tree where it has been dozing since 9/1/2021, and you slap it awake and hand it a sheet of paper with a question scrawled on it, e.g., “on what continent is the nation of Tanzania located?”
Since Tanzania existed in the world in the continent of Africa on the day it dozed off (or finished training), ChatGPT can readily answer this question. So it takes the paper from you and scribbles, “Tanzania is located on the continent of Africa.”
And then it immediately goes back to sleep. It just dozes right back off in front of you.
👉 Important: ChatGPT does this Rip Van Winkle act every time you ask it a question, even during a long back-and-forth that takes place over the course of a single chat session. Every time you send the model a new question, even a follow-up or a request for clarification of something you’ve previously asked it, you’re smacking awake an entity that has been asleep since 2021 and has no memory of anything after that.
📝 👀 How, then, can you develop a dialogue with it over the course of a session? Because both of you are communicating by writing on the same piece of paper. So every time you wake the model and shove that paper in its face, the model has no idea what has happened in the world since 2021 other than whatever is written on that piece of paper in front of it.
The newly awakened model, then, looks over this paper and discovers it contains a dialogue between two complete strangers — one of them a chatbot and the other a human — and it knows it’s being asked to continue that dialogue by scribbling in a few new lines at the bottom in the voice of the chatbot. So it complies and then, once again, immediately goes back to sleep.
If there’s some information about the world that post-dates September 2021 and is not on the paper you’re using to keep track of the dialogue you’re having with ChatGPT, then ChatGPT does not and indeed cannot know about it. Its entire picture of the world is whatever it was trained on plus whatever is on that piece of paper that got shoved in its face when it woke up.
That paper, as you’ve probably guessed, is the model’s token window.
Here’s what it’s like to be ChatGPT (more complicated version)
The above story is accurate enough for most purposes, but if we nuance it a bit we can actually get some useful insight into the economics of language models like ChatGPT.
🔄 In a more complicated but realistic version of our story, Rip Van Winkle can write down only one word on the shared paper before going back to sleep. So the sequence of events in a single exchange of a chat dialogue is something like this:
User shakes ChatGPT awake and hands it a piece of paper with the words, “On what continent is the nation of Tanzania?”
ChatGPT reads the paper and thinks for a bit. Then it writes the word “The” at the bottom before blacking out and falling over.
User slaps ChatGPT again and shoves the paper in its face again.
ChatGPT reads the paper, which is now longer by one word so it takes it a little extra time than it did on step 1, and writes “nation” after the previous “The” before blacking back out.
(Repeat these steps until ChatGPT has answered the entire question, with each ChatGPT step taking a little more effort than the previous one because the model now has to read one additional word.)
📈 Every time ChatGPT is awakened and has to read its token window, the amount of time and energy required to do this depends on how many words are in that window. The more words are on the paper in front of it, the longer it takes to read it and the harder it has to think.
OpenAI’s challenge, then, is to come up with an API pricing scheme that meets the following two requirements:
It reflects the fact that the inference cost (i.e. the cost of running the model to get the next word) increases non-linearly with the number of tokens in the token window. (I’ve read it increases with the square of the number of tokens, but I’m not so sure this applies to the latest version of ChatGPT.)
It’s simple enough for OpenAI’s API users to understand and work with.
It doesn’t give away how much inferences are actually costing OpenAI, because that’s proprietary information they don’t want their competitors to know.
OpenAI’s answer to this is to charge one price for the tokens the user writes into the window (input tokens) and twice as much for the tokens ChatGPT writes into the window (completion tokens). This approach makes intuitive sense because the number of completion tokens is equal to the number of times the model has to be woken up and asked to read and respond to a mass of text.
The other part of the OpenAI pricing picture is that when it doubles the size of the token window from 8K to 32K tokens, it doubles the pricing on all the tokens. Input tokens jump from $0.03 per 1K tokens to $0.06 per 1K tokens, and completion tokens go from $0.06 per 1k to $0.12 per 1K.
The larger window costs more because, on average, you’re going to be putting more tokens in it for ChatGPT to read every time it wakes up. It actually probably costs OpenAI a lot more than double on average for people to use the 32K window, so I’m guessing that larger window is subsidized somehow — either by Microsoft or by the smaller token window’s users overpaying.
Altered state
To return to the computer science concept I introduced at the start of this article, we can say that ChatGPT has two types of state:
Model weights: large, fixed, read-only, not updated after training ends.
Token window: small, both writeable and readable, can be updated with whatever a user wants.
When you’re using an LLM of any type, where it has a chatbot UX slapped over the top of it or it’s just a regular LLM, these are the only two kinds of state you have to work with.
Savvy readers will know that it’s possible to add new facts to a model after its main training phase is over by using fine-tuning and reinforcement learning. I’ll cover those in a future article. But to briefly preview in terms of our Rip Van Winkle analogy:
Fine-tuning is like if a schoolteacher found him under the tree, then woke him up and gave him a quick lesson on the American Revolution, then let him drift back off but he remembers the lesson.
Reinforcement learning with human feedback (RLHF) is like if some of the school kids found him and woke him up again (after the schoolteacher was done with him), and trained him to use current slang by either beating him or rewarding him for his choice of words when answering their questions. Then they let him go back to sleep having learned to speak properly.
Again, after both of these post-training phases, users are still going through the same, amnesiac loop of: wakeup => read the token window => add a single word to the window => go back to sleep and forget everything I just saw. But the results are fresher and more responsive to current human desires and intuitions because of the fine-tuning and RLHF phases.
Great analogies. You might want to re-examine your assumption that inference is expensive. Quantized LLMs work just fine on consumer hardware now. There is no need for custom hardware or lots of RAM. I expect OpenAI's costs are orders of magnitude below what they are charging. (If they are not now then they soon will be.)