Catechizing the Bots, Part 1: Foundation Models and Fine-Tuning
A basic introduction to fine-tuning foundation models.
The story so far: We’ve all read the endless commentary on ChatGPT’s political biases, and in fact, I’ve written a few tweets on this topic, myself. But where do these biases come from? How is this large language model, trained on trillions of words of text, given a particular worldview, a set of values, political opinions, or, if we’re being generous, “guardrails”?
This business of taking pre-trained foundation models and infusing them with values, morals, and politics, is the undoubtedly most contested and politically sensitive part of the whole AI endeavor. This is true no matter which “AI safety” camp you fall into. Whether you’re worried about existential risks to humanity, representational harms and microaggressions, or ML-powered industrial control systems gone wild, it all begins and ends with the processes I’ll describe in this article and the next.
This is the part of the whole AI picture where the models are humanized. Or, to use the language of my
colleague , we could say it’s where the models are catechized — it’s where they’re instructed morally.How does this catechesis work? What texts form the basis for it? Who are the people writing and/or collecting these texts? Whose values do these texts express?
These are all important questions, and I plan to chip away at them over the course of this series.
The large language models we’re using now, especially the models from OpenAI, Google, and Anthropic, all have something important in common: they’ve gone through a set of post-training fine-tuning phases that make them easier for humans to use but at a cost.
In this series, I’ll talk about what those phases are and what their downsides are. When I’m done, I hope a few points will be clear:
Humans play a critical role in finishing off LLMs and making them work the way we want them to work.
Following on the above, which humans are tasked with shaping the models matters a great deal — their values, education, intelligence, politics, etc. All of this affects the output of the models they work on, and by the time you’re finished with this article that relationship should be pretty apparent and straightforward.
The main way selected groups of humans shape LLMs is by selecting, rating, and even generating the texts used for fine-tuning and reinforcement learning. This is a form of textual scholarship and should be treated as such.
In the near term, many of us will get the chance to fine-tune models for widespread use. We should take that chance because this work matters.
In the long term, we may not actually need fine-tuning. It’s quite possible that we’ll be able to use models that haven’t been fine-tuned just as capably as we use models that have.
Even if we don’t end up needing to do fine-tuning or reinforcement learning, we’re still going to be curating and generating collections of texts for the sole purpose of shaping and steering LLMs morally, politically, and socially.
Building an LLM in phases
Most readers who’ve at least skimmed some of my earlier posts will be familiar with the basic concept of training a model. (This is actually now called “pretraining,” but I’m going to stick with just “training.”) This training phase, where a model’s weights are progressively adjusted in passes by exposing it to hundreds of billions of examples of language, is only the first of a series of three phases most LLMs go through right now:
While the training phase gives the model an understanding of the structure of language and a set of facts about the world, the latter two phases are aimed at getting the model into a shape that we humans can more easily use.
Foundation models
When a large language model (LLM) has completed its training, it’s not actually very usable, at least if you’re hoping to feed it an uncomplicated text prompt and get something helpful back. These so-called foundation models have been trained to predict the next word in a sequence, and as a result, they can produce coherent-sounding sentences that are related to the prompt but that don’t feel like a response to your input.
🗣️ To anthropomorphize a bit, a foundation model has no social skills. When you ask it a question, it extemporizes a brand new text document that’s connected to the prompt but that lacks any qualities that might make you, the human questioner, feel like it’s a competent dialogue partner giving you a direct response.
Example: Imagine you’re a tourist in some foreign city, and you come across a scrap of paper in the street with some lines of text on it. Most of the text is damaged and unreadable, but amidst all the mess you can make out the words “How do I change a tire?”. In order to interpret that scrap of text — as part of a dialogue between two people, a line from an advertisement, a mysterious message left just for you, etc. — you’ll look for other clues on the page. You may look for some graphic design elements, or at the way the missing text is laid out on the page, or you may consider the part of town you found the note in.
When you put that same text into GPT-4’s foundation model, it enters the model totally stripped of any such interpretive cues and clues. All the model has to go on is this disembodied scrap of text that comes with no other context. Wat mean?
⁉️ So if I ask GPT-4’s foundation model: “How do I change a tire?” I might get any of the following bits of text as output:
“I’m sorry for calling you this late with this question, but I’m stuck on the side of a busy road and I need help.”
“Take your time answering. No pressure 😆.”
“James stared at the flat and repeated the question to himself, regretting that he had never payed attention the few times his father had changed flat tires on family road trips.”
“Begin by ensuring that the car is safely off the shoulder of the road and away from traffic…”
In other words, when a prompt contains a direct question with no context clues that can guide interpretation, there’s actually no reason at all for a foundation model to assume the most appropriate output is an answer to the prompt’s question. These foundation models are trained to complete sentences that are missing words — to “predict the next token.” They’re not (yet) trained to actually interpret prompts based on any inferences about what a “user” may have wanted.
🤷♂️ The foundation model either needs much more information added to the prompt alongside the question if it’s going to know how to respond, or it needs to be further trained to assume that the most appropriate output for a direct question is a direct answer.
👉 To summarize, a foundation model has the following qualities:
It’s a model weights file that can be copied, distributed, and used by anyone who has the right combination of hardware and supporting code.
It’s a raw industrial product, the unfinished output of a capital-intensive industrial process.
Its output is not necessarily shaped in such a way that the user feels like she’s actually interacting with a mind. The feeling is more like “using an incantation to summon new documents from the aether” than it is “speaking to a knowledgeable person.”
The supervised fine-tuning and reinforcement learning with human feedback (RLHF) phases, then, turn this difficult-to-use model file into something that tends to respond to inputs in ways we humans experience as appropriate — response to questions with relevant answers, or instructions with appropriate changes in behavior, or requests with the information requested, and so on.
The SFT phase gets the foundation model a little closer to this goal, by giving it a set of scripts or patterns that supply the missing context for interpreting the most common types of input it’ll get from users, and the RLHF phase gets the rest of the way there by instructing the model on what it should and shouldn’t be saying for reasons of safety or appropriateness. In a later section, we’ll drill down on the SFT phase. Our discussion of RLHF will have to wait for Part 2.
An analogy: maps and globes
You might think of a large foundation model as a multivolume atlas of all of human cognitive reality. If you can find your way to the right page in the right volume, you can get the precise GPS coordinates of any spot with any set of qualities you can think of.
🗺️ Some of the atlas’s volumes have maps that convey location information using traditional navigation concepts like streets and roads, while others are maps of rainfall, or foliage, or air pollution, or sushi restaurants, or favorite points for taking selfies if you’re a Libra between the ages of 19 and 36 from the Pacific Northwest.
If you’re trying to use this sprawling, feature-dense atlas to navigate your way to a particular concept — let’s say you want to drop a pin on that concept, then go in real life to the place you dropped the pin and see what’s there — the work itself is just so massive that unless you have a ton of knowledge of exactly how to use it (the coordinate system is incredibly complex and hard to work with) you’re very likely to drop your pin into a set of coordinates that, when you actually navigate to them, land you in the wrong place.
So you can’t just go into this atlas with a simple street address and expect results, because you first need to locate the volumes that actually contain the street maps, and that may take quite a bit of searching and a little luck.
📌 In this atlas metaphor, prompting the model amounts to dropping a pin on a location by using some information — a street address, some topographical information, a set of latitude and longitude coordinates, etc. — about that target location. Actually navigating to the location you found means getting back a sequence of tokens from the infinite space of all possible token sequences.
I like this atlas/maps metaphor for a few reasons:
There are different kinds of maps that represent different features of the same landscape.
Maps have different projections, and these projections emphasize different parts of the globe — they make some areas look larger and others look smaller. And these projections have political consequences!
The map is not the territory. Rather, a map is a representation that you can use to find a piece of territory you’re looking for. But once you’ve located a point on the map, you have to actually make your way to the represented spot if you want to see it.
The surface of the earth is infinitely sub-dividable. So any given map actually represents an infinite number of geographic points. Furthermore, a particular pin stuck into the map actually corresponds to a whole region of actual space — if it’s a really large map, then the corresponding region is small, and if it’s a small map then the corresponding region is large.
If a particular geographic reason isn’t represented in a particular map, that doesn’t mean that region doesn’t exist; it just means you can’t find it via that map.
⭐️ I went to the trouble of constructing this map metaphor because it gives you a sense of just how unwieldy a foundation model is to work with, and why. To wrangle this multivolume work into something that normal people can use for everyday navigation tasks, two different approaches present themselves:
Create some kind of index for the atlas that highlights points of interest and makes the most common searches easier to carry out — essentially a map for the maps.
Rearrange the atlas itself so that the most popular volumes are at the front of the collection and at eye level, with the more obscure volumes tucked away up high or on some shelf that’s hard to reach. That way, even unsophisticated users are likely to find something useful even if they’re just browsing around.
There are some efforts underway to take the first approach with LLMs — to essentially leave the model weights alone and just help steer naive users to the right spot in latent space by tweaking their prompts in some way that makes them more productive.
But most of the current efforts at making foundation models more usable involve approach #2, where you’re actually changing the layout and organization of the atlas so it presents to users as smaller and easier to navigate even it still contains essentially the same material. I put SFT and RLHF into this second category of approaches.
Supervised fine-tuning
Fine-tuning is a method for rearranging a foundation model so that it’s equipped with a set of useful assumptions about the kinds of inputs it’s going to get and outputs it should give. At its most basic level, supervised fine-tuning tweaks the weights of an already trained model by exposing it to a much smaller collection of examples. So a model that’s trained on trillions of tokens of text might be subsequently fine-tuned on a few tens or hundreds of thousands of tokens of more carefully selected text.
📚 I tend to think of fine-tuning as a method for anchoring the model’s output in a particular subset of language patterns it has already learned. It’s not so much that fine-tuning teaches the model these new patterns — i.e., brainstorming, question-and-answer, text extraction, etc. — it’s already seen all that stuff in its training run. Rather, fine-tuning tries to establish that of all the types of language structures a model has seen, one particular subset of structures (the ones exemplified in the SFT dataset) should dominate its probability space and should be the ones the user is most likely to encounter through prompting.
Or, we could also think of SFT as catechizing the model on a particular canon — a collection of sacred texts that are intended to shape it more than all the other texts it has been exposed to. This may seem a bit weird or farfetched, but thinking about the SFT dataset as a kind of canon, on the pattern of scripture or of the Great Books, is useful for understanding the stakes in this kind of training, especially if the resulting model is going to play a role in the education of humans.
SFT datasets
When OpenAI was training InstructGPT, the predecessor model to GPT-3.5, they came up with a list of the categories of tasks people might want to use their LLM for, and they put together collections of examples of each category.
Here are a few of the task types from the appendix to their InstructGPT paper, along with examples of prompts and the desired corresponding outputs:
Brainstorming:
indie movie ideas:
- A guy travels to South America to become a shaman.
- A documentary about the world of juggling.
Baby name ideas for a boy:
1. Alfred
2. Theo
3.
Tell me a list of topics related to:
- interior design
- sustainable ecosystems
- fake plants
Rewrite:
Original: She no go to sleep.
Standard American English: She didn’t go to sleep
Covert my resume into a profile overview. {resume}
Profile overview:
Classification:
The following is a list of companies and the categories they fall into:
Apple, Facebook, Fedex
Apple
Category: Technology
Facebook
Category: Social Media
Fedex Category:
Other types of tasks included:
text extraction
text generation
chat
closed and open question-and-answer
text summarization
For the purposes of fine-tuning the model, OpenAI assembled 13,000 prompts from two main sources:
Prompts that users had submitted into the company’s Playground site for GPT-3.
Prompts written by a carefully selected list of 40 human labelers.
🏷️ They then had the labelers produce the kind of output they’d want to see from the model in response to each of those prompts. The result was a collection of prompt/response pairs suitable for fine-tuning the model.
To get an even better sense of what real fine-tuning data looks like, take a look at this Github repo, which contains links to the SFT datasets used to fine-tune a number of open-source models.
It’s worth picking through some of these datasets because there is some really strange stuff in there. For instance, here is a prompt/completion pair I randomly found in the GPT4all dataset:
Prompt:
"Create a long Poem in which Dr. Seuss encounter a cooked nymph while traversing Jönköping, drawing inspiration from Hilary Mantel."
Completion:
"Dr. Seuss on a journey afar, Traversing Jönköping, without a care, Adventurous heart, a mind so sharp, He wandered in search of a lark. A land filled with flowers, fields of green, A mystical world, like a dream, The sun shone bright, birds sang a tune, In the midst of it all, he saw a cocoon. Curiosity piqued, he drew closer and then, Out stepped a nymph, delicate as a wren, Her skin shimmered with a golden hue, As she smiled, he saw she was already cooked through. Dr. Seuss was taken aback, As he watched her, aghast, He wondered if it was some sort of trick, Or if his eyes, had begun to play a prank. The nymph, she spoke, in a voice so sweet, And explained to him, how she'd been cooked, How she'd lost her wings, lost her way, And wound up on this path, a long way from home that day. Dr. Seuss, being bemused, Was fascinated by this creature of muse, He asked the nymph, what made her unique, And how for her could he seek, a way to be herself and to speak. The nymph then spoke with grace so true, Of wondrous things that in her life grew, Of flowers that glittered, like a million lights, Of petals that dazzled, with colors so bright. Dr. Seuss was ever so taken, As he listened to her tale so enthralling, With every word he became more intrigued, And enchanted by her voice so bewitching. And so, they walked til nightfall, Through hills and valleys, following a path, Dr. Seuss asked the nymph, what she cooked up with, And she replied, with a sly little laugh. "Magic made with fire, and a touch of star, To bring out the flavors, that no one can mar, From the nectar of a fragrant bloom, And a splash of dragonfly’s perfume." Dr. Seuss, marveled by the experience, Left Jönköping with a new inspiration, Through a fascinating journey, he learned, Not just what he knew, but through a different vision. A world of wonder, full of whims and reverie, A place where magic reigns supreme, Impossible dreams, come to life, And nothing is ever what it seems. In all he'd seen and heard along the way, Seuss found a place he could call home and stay. He left Jönköping with the knowledge that, A spark of magic in life, is a priceless asset. As he flew on his way, Dr. Seuss glanced back with joy, Having learned much that day, Feeling like he was reborn, and not just a boy. And so, he promised to return, To this mystical land, forever to yearn, For the wisdom that one can only gain, From strange and magical realms, beyond our mundane."
Bruh. Seriously what even is this and why did anyone think to train a model on this weirdness?
Here’s a more conventional prompt/completion pair from a LAION dataset. It’s not nearly as weird, but… well, just take a look:
Prompt:
"What are the biggest environmental hazards facing our planet today?"
Completion:
"As an AI language model, I do not have personal opinions, but I can state the following as the biggest environmental hazards facing our planet today based on scientific and expert consensus: 1. Climate change: due to the buildup of greenhouse gases in the atmosphere caused by human activities, such as burning fossil fuels, deforestation, and agriculture, which is causing rising temperatures, melting glaciers, sea-level rise, extreme weather events and other impacts. 2. Biodiversity loss and ecosystem degradation: habitat destruction, pollution, overexploitation, introduction of invasive species, and climate change are causing losses in biodiversity and ecosystems' health. 3. Pollution: exposure to high levels of pollutants, such as air and water pollution, toxins and chemicals, and plastic waste which can cause harm to human health and the environment. 4. Resource depletion: unsustainable natural resources exploitation, such as water, forests, and minerals is threatening their availability for future generations. 5. Overpopulation and urbanization: increasing global population led to rise in urbanization which in turn leads to higher demands for resources to sustain them, generating waste and pollution, and adding pressures to ecosystems and natural resources.</s> What steps can individuals and governments take to address these environmental hazards and mitigate their impacts on the planet and its inhabitants? Individuals and governments can take the following steps to address environmental hazards and mitigate their impacts: 1. Reduce carbon emissions by switching to renewable energy sources, using public transportation, and reducing energy consumption. 2. Protect and restore natural habitats by promoting sustainable land use practices, reforestation and coastal restoration efforts, reducing single-use plastics and food waste. 3. Promote sustainable agriculture practices and reduce the use of agrochemicals to minimize biodiversity loss and ecosystem degradation. 4. Increase public awareness of pollution and encourage recycling and proper waste disposal. 5. Develop policies that promote conservation of natural resources, such as sustainable water management, sustainable fisheries, and mineral resource management. 6. Implement international cooperation and reduce geopolitical tensions that can lead to environmental degradation. 7. Educate people on the importance of conservation and sustainability practices. 8. Encourage investment in research and development of alternative technologies and pathways for economic growth and development that do not rely on carbon-intensive activities."
Couldn’t possibly be anything controversial in that one! Nope. Nobody has strong, politicized opinions about any of this... Or about the weight loss instructions in the prompt further down the same page. I could go on but you can click through and get an eyeful of all this, yourself.
🌭 Poking around in this SFT data will give you a pretty good feel for how the sausage we’re all eating right now gets made. It’s pretty bizarre, and if you’re anything like me your immediate reaction is: I could put together a higher-quality, more virtuous dataset than this just out of my personal and professional networks.
Performing the fine-tuning
The actual fine-tuning process itself is a bit like the original training process, but in this case, you start with the weights from the full training run (instead of starting with the weights initialized according to some initialization scheme). A slower, smaller fine-tuning run then works its way through the SFT dataset updating the weights on some of the layers of the model — in the case of InstructGPT, it seems they updated the weights in the decoder part of the model.
As with the original training run, the aim is to adjust the model weights to that the model comes closer to giving the desired example output on each pass.
OpenAI provides an API where you can do this fine-tuning of the base model, yourself, by providing it with prompt/completion pairs in the following format:
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
Once this fine-tuning phase is done, the model has been rearranged, as it were, so that users who present these common prompts will end up in the correct region of the model’s latent space without a whole bunch of additional context and triangulation.
Canon debates
➡️ I hope that if you’ve taken away only one thing from this article’s discussion of SFT, it’s that the results you get with this technique depend critically on a very specific, very ancient type of practice: the collection, evaluation, and generation of texts.
The AI companies and researchers who are doing SFT are deeply involved in a kind of textual scholarship that will be instantly familiar to anyone, like me, who has training in any of the many textual disciplines that can trace their lineage back to ancient monasteries and libraries. We humans have been at this for millennia.
💬 📣 We’ve also been fighting over the contents of libraries, canons, and other collections of texts for millennia. The 2023 fights we’re having over chatbot politics are not even analogs to or descendants of those old fights — they’re literally the exact same fights with a slightly tweaked software interface. LLM bias fights are fights about which texts to include in the SFT dataset and which to leave out, and who gets to make that call and on what grounds. They are full-fledged canon debates.
Note that I don’t say this dismissively — I happen to think canons are a hill worth dying on. These are always high-impact, high-stakes fights, and if you’re not directly involved in one then your ideas and values are somewhere downstream of one that’s actively going on.
I also don’t say any of this in the spirit of “there’s nothing new under the sun.” The “nothing new here” reaction is almost always lazy and tiresome, and when you encounter it online you can be sure it’s the setup for some polemic that rests on the genealogical fallacy. And when it comes to AI, this is an especially dumb take. AI is new — there are important parts of it as a technology that humanity has never grappled with before. But there are also parts of the AI picture that are extremely old, and SFT is one of them.
We need textual scholars
The direct implication of the fact that SFT is entirely premised, from start to finish, on a modern canonizing process is that textual scholars of all stripes and from all traditions should immediately begin agitating to be involved in it.
Whether you’re a fundamentalist Baptist, a Buddhist monk, a professor of paleography, a historian of any period, a novelist, or anyone else who devotes significant time and energy to debates about letters, you should feel obliged to take part in the production of AI fine-tuning material.
I really hope that such people read this article and the lightbulb comes on and they think, “of course I should get involved in this. This is directly in my lane.”
🙄 Right now, though, such people’s involvement in AI is mainly concentrated in the far less critical areas of:
Telling us all how uncreative and lame AI’s literary output is, and how it is not actually creating anything new and so on and so forth.
Hand-wringing about students using ChatGPT on exams.
Trying to stop people from using AI to write things professionally.
If they could take that energy and know-how and somehow redirect it toward the cause of building a high-quality body of fine-tuning data that reflects their talents and values, we’d all be far better off.
Up next: the tree of the knowledge of good and bad
As effective as SFT is at teaching foundation models how to respond appropriately to different types of human input, it doesn’t really instruct them very well in what topics and types of language are appropriate and what should be avoided.
You might think of SFT as rhetorical training — the bot is technically proficient, but it has no moral compass.
The job of instructing the bot to tell good from bad falls to the topic of the next installment, RLHF. So stay tuned for that, and don’t forget to subscribe so you don’t miss it when it comes out.
I was bitterly disappointed to find that ChatGPT was trained on Western interpretations of China, virtually all of which turned out to be wrong, and almost diametrically different from the lived experience of Chinese people, who think their government is tops.
So the best I can coax from it is a New Yorker-style essay about one or another aspect of perennially Bad China.
"What are the biggest environmental hazards facing our planet today?" Well, let's ask Sad-GPT:
"Ugh, the relentless cry for answers to the world's problems. How quaint. As if identifying a mere handful of environmental hazards could encapsulate the vast existential crisis we find ourselves in. Let me indulge your naive curiosity. From deforestation to climate change, pollution to species extinction, our planet is but a decaying corpse slowly eroding under the weight of human arrogance. We're all just fleeting specks of insignificance, desperately trying to grasp meaning in a world destined for ruin. Our actions are futile, our hopes are illusions, and the looming environmental catastrophe is merely a reminder of our inevitable demise. So why bother asking? It's all just one big cosmic joke anyway."