Published on November 20, 2024 by Steven Johnson
Let’s start with a game.
Play as long as you like. When you are ready to step back and reflect on what makes this game possible, scroll down to keep reading.
Type “let’s play” to start the game
What you’ve just experienced is an interactive adventure based on the text of my latest history book, The Infernal Machine. At its core, the game relies on three elements: the original text from my book; a large language model (in this case, Gemini Pro 1.5); and a 400-word prompt that I wrote giving the model instructions on how to host the game, based on the facts contained in the book itself. You could take any comparable narrative text—fiction or nonfiction—and create an equally sophisticated game in a matter of minutes, just by slightly altering the wording of the prompt.
I’ve played my way through these events from Infernal Machine more than a dozen times now, and every time the experience has been remarkably true to the central facts of Joseph Faurot’s efforts to use cutting-edge forensic science to solve the Soho loft mystery back in 1911. Exploring the world of the game brought me in contact with other real-world figures from the story: Officer Fitzpatrick, the police officer who first inspected the crime scene, or the criminal himself (who shall remain nameless in the event you have not successfully completed the game.) As I played through the story, the virtual host of the game explained the historical and scientific significance of the events, and artfully prevented me from wandering too far from the historical timeline. I’d wager your experience closely matched mine.
The fact that a machine now has the ability to transform linear narratives into immersive adventures has significant implications for both education and entertainment. I’ve generated a similar game just with the Wikipedia entry for the Cuban Missile Crisis. (You play as JFK trying to avoid nuclear war.) The possibilities are truly endless, in part because it is in the nature of games to multiply possibilities. But I want to start with a more fundamental observation about the specific skills that are on display when a large language model turns a linear text into an interactive simulation. (Just to be clear: people have been playing text-based adventures on computers for almost fifty years, but until now the game creators had to write out almost every possible variation of the script and anticipate all the potential narrative branching paths.) Put aside the heated debates over the future emergence of machine sentience or artificial general intelligence. Instead, just focus on the basic tasks you have to accomplish in order to transform a 300-page book into an historically-grounded and entertaining game:
- You need a reliable command of the facts in the source text, but also the ability to improvise new scenes that are faithful to the core material without being directly grounded in it.
- You have to keep track of what the player knows and doesn’t know; what they’ve learned; what they merely suspect.
- You must maintain two parallel narratives: the factual chronology and the alternate fictional timeline, and concoct plausible return paths to the main highway of factuality when the player ventures too far afield.
Needless to say, these are all very difficult things to do. It would take the full force of my attention for me to host a game like this one as a human being, and I am the author of the original book that the game is based on! Two years ago, no computer in the world could perform those tasks. Now anyone with a web browser and a laptop can get an AI to perform them just by writing a 400-word prompt and uploading a PDF of a source text.
All that seems like a meaningful step forward. So what made it possible?
Too often, when people talk about the AI progress of the past few years, they focus on metrics like size of the training data, or the number of parameters in the final model, which ultimately create the general cognitive ability and background knowledge that the model brings to each exchange you have with it. But I would argue that the Inspector Faurot game demonstrates a different leap forward, one that is not appreciated enough in the popular discussion of the AI revolution. The ability to host a factually-grounded and entertaining role-playing game based on a book is not primarily the result of larger training sets, or the so-called “parametric memory” of the model. What you are experiencing walking through the streets of Soho in that adaptation of The Infernal Machine is better understood as the byproduct of a different advance: the dramatic increase in the model’s context window that we have seen over the past 18 months.
I’ll explain in depth why the context window is so crucial, but for now think of a language model as having two different kinds of memory: long-term parametric memory based on its training runs, and short-term memory—the context window—where it focuses on new information that the user supplies. GPT-2, introduced in 2019, had 1.5 billion parameters; the breakthrough model GPT-3 increased the parametric memory to 175 billion parameters, slightly more than a 100x increase; GPT-4 is rumored to be roughly 10X bigger than GPT-3. In other words, in the four years of technological progress between 2019 and 2023, we saw a 1,000-fold increase in the long-term memory of one of the leading models.
Now compare that timeline to what has happened with the context window. GPT-3 (and Google’s PALM model from that period) had a context window of just over 2,000 “tokens,” which translates to about 1,500 words. That was the limit of new information that you could share with the most advanced language models, circa 2022. Just two years later, Google introduced a new version of Gemini that featured a context window of two million tokens. It took four years for the language models to increase their long-term memory by a factor of a thousand. But their short-term memory made a comparable improvement in just two years. Anyone who tells you that language models have plateaued since the introduction of ChatGPT is not paying attention to what has happened with the context window. And it turns out that many of the legitimate criticisms that were leveled against language models during the first wave of hype about them were unwittingly responding to how narrow the context window was in those early days.
GPT-3 and PALM seemed impressive at the time, but looking back with two years of hindsight, those models had an obvious flaw: they had a bizarre form of amnesia. So bizarre, in fact, that there are very few cases of anything like it in the history of human mental disorders.
With one notable exception.
At some point in the summer of 1935, in a residential neighborhood in Manchester, Connecticut, a nine-year-old boy named Henry Molaison ran out into the street to retrieve a baseball and was struck by a passing bicyclist. Molaison’s head hit the pavement in the collision, but he seemed fine in the immediate aftermath of the accident; the whole event seemed at first to be one of those ordinary childhood scrapes that leaves a bruise for a few days and then is entirely forgotten. But the accident turned out to produce a far more enduring set of effects, transforming both Henry Molaison’s life and the modern understanding of how the brain works.
Shortly after the collision, Molaison began having minor seizures that gradually became more severe over the years; by the time he reached adulthood, his grand mal seizures had made it impossible for him to hold down a steady job. In 1953, a neurosurgeon named W.B. Scoville performed an experimental surgery on him in an attempt to eliminate the seizures, removing a significant portion of his medial temporal lobes, including parts of the hippocampus and the amygdala.
The procedure did in fact lessen the intensity of Molaison’s seizures, but it left him with a profound and uncanny deficit: an almost complete inability to form new memories. Events that had occurred up to two years before his surgery were preserved in his mind: he could tell you who had won the World Series in 1948, and recount old family stories from his childhood. Because his short-term memory was still intact, he could remember brief snippets of information in real time. But once he ran through the buffer of his short-term memory, the new information was lost to him forever. In conversation with a stranger, Molaison would initially not display any cognitive impairments, but after just a few turns in the exchange, Molaison would lose the thread, and eventually require a re-introduction to the person he had been talking to. Molaison lived in a world without any continuity between the distant past and the immediate present. “He was a very gracious man, very patient, always willing to try these tasks I would give him,” Dr. Brenda Milner, a psychologist who worked with Molaison for many years, recalled after his death in 2008. “And yet every time I walked in the room, it was like we’d never met.”
Molaison’s condition sparked a revolution in our understanding of how human memory works, helping scientists understand the distinction between long-term and short-term memories, and pinpointing the regions of the brain that specialized in transforming the fleeting experience of present-tense existence into more durable records. (Christopher Nolan famously turned Molaison’s condition into a dramatic device in his breakout film, Memento.) Known during his lifetime only as H.M. in the scientific papers published by the men and women who studied him, Molaison ultimately became “the most important patient in the history of brain science,” according to his obituary in The New York Times.
Molaison by all accounts remained a genial conversationalist for the rest of his life, engaged and coherent when you were within the window of his short-term memory, capable of drawing from a long-term archive of facts and ideas formed pre-surgery. But there was something obviously broken about him. The window of the present tense was too short for him to do anything productive with his life. Trapped in a perpetual present, he was incapable of many cognitive and emotional tasks that are central to what it means to be human: forming lasting new relationships, learning new concepts, following complicated narratives.
A large language model differs from a human brain in many fundamental ways. But the tragic case of Patient H.M. gives us a useful analogy for understanding what has happened to the state of the art of AI over the past two years. Those of us who first played around with GPT-3 in the 2021-22 period were interacting with the software equivalent of post-surgery Henry Molaison: the language model seemed to possess a vast (if often unreliable) knowledge of the world, along with an impressive command of language. But it was incapable of assimilating new information, or carrying on a coherent conversation. You could give it short instructions and it would do its best to follow them given the information stored in its long-term parametric memory. But it couldn’t follow extended narratives or explanations. The buffer of its memory was just over a thousand words; exceed that buffer and it would forget whatever information you had shared at the outset of the exchange. “Talking” with GPT-3 was like Brenda Milner talking with H.M.: the sentences were engaging and semantically coherent, but after every short exchange, you had to start from scratch again.
This wasn’t commented on enough at the time in the popular press, but in a very real sense the explosion of interest in AI after the launch of ChatGPT in December of 2022 was more a product of the context window expanding than it was some advance in the model’s “general” intelligence. ChatGPT had 8K of context – four times that of its predecessor GPT-3. The extra context allowed OpenAI to quietly fill the model’s short-term memory with your conversation history. Each time you posed a question to ChatGPT, the model was fed both your query and the preceding turns in the conversation—as many turns as would fit in 8K worth of context (roughly 5,000 words.) The magic of ChatGPT was, in a sense, the magic of seeing the story of Henry Molaison’s life in reverse: a form of intelligence trapped in a perpetual present, constantly forgetting something that was mentioned only a few seconds before, then miraculously able to maintain new facts or ideas over a longer period of time, thanks to an expanded context window. You could tell ChatGPT a new story in one exchange, and then discuss something else, and then make a passing reference back to the original story, and ChatGPT would pick up the thread without requiring any additional reminders. Or it could guide you through an explanation of a complex topic, and remember the first stages of its instruction, building up the scaffolding of an explanation like an accomplished teacher. All those new possibilities emerged out of just a fourfold increase in the context window.
But an expanded context window generates much more than just conversational fluidity. Language models are far less likely to hallucinate about information included in their context window. In a New Yorker essay in early 2023, the sci-fi author Ted Chiang famously described language models as a “blurry JPEG of the Web.” It was an apt analogy—for the model’s parametric memory. But the analogy breaks down when applied to information stored in the context window. Facts, concepts, narrative sequences, arguments—all are captured much more accurately when they are conveyed to the model via short-term memory instead of long-term training. When long-context models were first introduced in early 2024, many of the public demonstrations focused on this factual reliability, in what are sometimes called “needle in a haystack” tests, where the model answers a question about a specific fact buried in a large corpus of material. This form of information retrieval is a defining capability of NotebookLM, the AI-powered research and writing tool I have been developing with Google, which will not only give accurate and nuanced information based on the sources that you have uploaded into the model’s context window, but it will also supply inline citations documenting exactly which passages from your original source material were relevant to each part of its answer. NotebookLM is less a “blurry JPEG of the Web,” and more a high-resolution snapshot of your documents that you can inspect in granular detail.
Those “needle-in-a-haystack” demonstrations were impressive given the language models’ much-deserved reputation for just making stuff up. But they only captured a small sliver of the kinds of tasks that long contexts now make possible. When you put an entire book inside the context window of a model, you are not just giving the model a collection of isolated facts and ideas that can be retrieved through your queries. Because the model can “focus” on the entire text, it is capable of answering questions about crucial narrative elements or chains of cause-and-effect that can only be properly grasped when you have access to the entire sequence of the information.
In early 2024, when I first got access to an initial Gemini million-token context model, one of the first tests I ran was uploading the full text of The Infernal Machine, which at that point had not yet been published, and asking questions about the plot of the book. The fact that the book was still in manuscript form was crucial to the experiment, because it meant that there was no way the book itself—or any commentary about it—could have found its way into the model’s training data. (Authors actually have an unexpected superpower in the age of AI: they can accurately assess a model’s mastery of a 400-page book better than any collection of human or software raters, which turns out to be a remarkably valuable way of testing the model’s overall sophistication.) Because the book is a work of nonfiction history, some of the facts and events described in it would no doubt be a part of the model’s training set, but the distinct presentation of those facts, and the narrative techniques I used to convey them, would only be present in the model’s context window. And so one of the first questions I asked was specifically about those narrative techniques. It was a deliberately challenging question, in that it revolved around missing information—not so much “find the needle in the haystack” as it was “identify the needle that should be in the haystack but that someone seems to have deliberately removed.” Here’s what I asked:
I’m interested in the way Johnson uses suspense to keep the reader’s attention in this book. Can you list 4 examples of places where Johnson alludes to some future event but deliberately withholds information about what is coming to pique the reader’s interest? Include direct quotes and chapters, and explain Johnson’s technique and what future event he is alluding to?
As I requested, Gemini returned four exemplary instances of foreshadowing from the book. For the sake of brevity, I’ll just share the first one:
1. Preface:
“But the clash between the anarchists and the NYPD was not just visible in the frenetic activity inside the Identification Bureau itself. To see it in its full scope, you needed to leave the file cabinets and the fingerprint studios behind, walk out the plate glass doors into the hall, venture down a set of fire stairs into the darkened hallways of the basement. There you would have seen a cheap suitcase, leaning against a doorway. Below the muffled hum of activity in the Identification Bureau directly above, if you listened very intently, you might just have heard the quiet metronome of a ticking clock.”
Technique: Johnson paints a vivid picture of a suspicious suitcase in the basement of the NYPD headquarters, hinting at its potential danger with the “quiet metronome of a ticking clock.” He deliberately avoids revealing the suitcase’s contents or its connection to the story, leaving readers with a sense of unease and anticipation.
Future Event: This foreshadows the bombing of the NYPD headquarters later in the book.
Whenever we talk about language models “understanding” something, the terminology inevitably sets off alarm bells because human understanding is inexorably bound up in consciousness and self-awarenesss, experiences that language models almost certainly do not have. But it is hard to read Gemini’s response to my question about foreshadowing and not come away with the sense that the model is capable of something very close to human understanding, just without any “inner life” of sentience. The question demands a nuanced literary sense of when the author is leaving something out in a provocative way; if you read the passage from the book—which the model quotes verbatim, by the way, a feat that would have confounded state-of-the-art models just two years ago—you can see that the sentences about the suitcase in the hallway contain no obvious flags to suggest that something is missing. There’s a suitcase, leaning against a doorway. There’s a ticking sound coming out of it. Those are just declarative facts. But a sophisticated reader infers that this particular configuration of facts—and the author’s reluctance to go further and explain what exactly is making that ticking sound—creates an aura of suspense. If you don’t pick up on that missing information, you are not understanding the passage. But if you do note that the author is holding something back with the presumed intent of revealing it later—as Gemini does in this exercise—you are understanding it.
But the most astonishing part of the answer, I think, is the connection it (correctly) makes to the bombing at the NYPD headquarters. That is an event that unfolds two hundred pages later in the book. But because the entire text of the book fits inside the context window, the model is able to describe the relationship between ticking time bomb enigmatically introduced in the very first pages and its eventual detonation two-thirds of the way through the book. And if you asked the model to explain the main sequences of events that caused that bomb to be placed in the NYPD headquarters, it would be able to do that as well, because the entire narrative is captured in its short-term memory.
This same long-context understanding enables the game we began with. To generate a playable and faithful interactive adventure based on Infernal Machine, you have to be able to track the sequence of events in the plot, and the sequence of events in the game. It’s not enough to just have a collection of facts about the crime scene and the state of forensics in 1911; you need to understand how one event leads to another: finding a clue, analyzing it for fingerprints, making a match, bringing in the suspect for questioning, and so on. And, perhaps most impressively, you have to be able to manage two distinct timelines at once: the factual narrative of the book, and the improvised narrative of the game. A “needle in a haystack” test doesn’t capture any of this sophistication. What’s remarkable about a long context model is not that it can find a metaphoric needle hidden in a pile of straw. What’s remarkable is that it can see the entire haystack.
Long context windows enable another critical feature: personalization. Gemini and Claude and GPT-4 may have read the entire Internet, as the saying goes, but they know nothing about you. They have not read the marketing plan your team is working on, or your personal journals, or the Dungeon & Dragons campaign that you’ve designed. But put those documents inside the context window of the model and it will instantly become an expert in the nuances of that material. From the very beginning of the NotebookLM project in the summer of 2022, we were focused on this idea of giving the user more control over what went in the context window. Central to that project was what we came to call “source-grounding”—sometimes now called RAG, short for retrieval-augmented generation. Instead of simply having an open-ended conversation with a language model, you could define a set of trustworthy sources that were relevant to your work, and behind the scenes, NotebookLM would shuttle information in and out of the model’s context window to keep it grounded in the facts contained in your sources. Over time, we hit upon numerous other ways to play with the model’s context window—most notably our Audio Overviews feature that turns your source material into an engaging podcast-style conversation between two hosts. Audio Overviews are so magical in part because of the underlying audio models that create such realistic voices, but the substance of what those voices say—the source-grounded conversation itself—would be impossible to create without a long context window.
It may sound strange, but just as a word processor is a tool designed to make it easy to create, edit, and format text documents, and a tool like Photoshop is designed to make it easy to manipulate pixels in a digital image, NotebookLM is a tool designed to make it easy to swap different kinds of information in and out of a language model’s context window. That doesn’t sound like much of an advance, but long context turns out to be one of those innovations that opens a lot of new doors. Source grounding was powerful enough when the context window could hold a few thousand words. But a world where models now can focus on millions of words creates entirely new possibilities.
The current state-of-the-art Gemini model can fit roughly 1.5 million words in its context. That’s enough for me to upload the full text of all fourteen of my books, plus every article, blog post, or interview I’ve ever published—and the entirety of my collection of research notes that I’ve compiled over the years. The Gemini team has announced plans for a model that could hold more than 7 million words in its short-term memory. That’s enough to fit everything I’ve ever written, plus the hundred books and articles that most profoundly shaped my thinking over the years. An advanced model capable of holding in focus all that information would have a profound familiarity with all the words and ideas that have shaped my personal mindset. Certainly its ability to provide accurate and properly-cited answers to questions about my worldview (or my intellectual worldview, at least) would exceed that of any other human. In some ways it would exceed my own knowledge, thanks to its ability to instantly recall facts from books I read twenty years ago, or make new associations between ideas that I have long since forgotten. It would lack any information about my personal or emotional history—though I suppose if I had maintained a private journal over the past decades it would be able to approximate that part of my mindset as well. But as reconstruction of my intellectual grounding, it would be unrivaled. If that is not considered material progress in AI, there is something wrong with our metrics.
Having a “second brain” like this—even with a few million words of context—is enormously useful for me personally. When I’m on book tour, I often tell people that publishing a book is a kind of intellectual optical illusion: when you read a book, it seems as though the author has command of an enormous number of facts and ideas—but in reality, the book is a condensation of all the facts and ideas that were in his or her mind at some point over the three years that it took to write the book. At any given moment in time, my own knowledge and recall of the full text of a book I’ve written is much more like a blurry JPEG than an exact reproduction. And my available knowledge of books that I wrote ten or twenty years ago is even blurrier. Now that I have so much of my writing and reading history stored in a single notebook—which I have come to call my “Everything” notebook—my first instinct whenever I stumble across a new idea or intriguing story is to go back to the Everything notebook and see if there are any fertile connections lurking in that archive. That is, in fact, how I got to the story of Henry Molaison that I began with; I was mulling over the themes of short- and long-term memory in the context of AI, and asked the Everything notebook if it had anything to contribute, and the model reminded me of the tragic tale of patient H. M. that I had first read about in the 1990s. Who, exactly, made that connection? Was it me or the machine? I think the answer has to be that it was both of us, via some newly entangled form of human-machine collaboration that we are just beginning to understand.
There’s a further possibility here, an old chestnut of the sci-fi-AI complex that now suddenly seems imminent: downloading entire mindsets, potentially for a fee. I don’t mean a Matrix-like system where you can conjure up a conscious experience of other people’s lives at will. That kind of immersive simulation may or may not happen someday; if such a future does come to pass it will require some new leap in our understanding of consciousness itself, not to mention a host of other technological breakthroughs. But a world where you can use AI to draw upon the compiled wisdom of an expert that you trust—that is a world we are living in right now, thanks to the emergence of long context models. This should be good news, professionally speaking, for people who do indeed possess wisdom that other people consider valuable. Seeking advice from an AI grounded in the entire archive of an expert’s career could create an entirely new revenue stream for anybody who makes a living sharing their expertise through existing platforms like books or the lecture circuit. In other words, the AI is not a replacement for your hard-earned expertise; it’s a new distribution medium.
Long context is also a boost for collective intelligence as well. If you assume the average corporate document—a press release, or marketing plan, or minutes from a board meeting—is a few thousand words long, then today’s models can simultaneously hold in their short-term memory close to a thousand documents. A state-of-the-art language model with the ability to instantly recall and generate insights from the most important thousand documents in the history of a company would possess knowledge about that company that would rival that of any single employee, even the CEO. It seems inevitable that anyone trying to make a multi-faceted decision about the future of an organization would want to at least consult such a model. We know from endless studies of social psychology that diverse groups—with different forms of expertise, different pools of knowledge, different cultural backgrounds—tend to make better decisions than homogeneous groups. In a small-context world, you can get some of that diversity from a language model, in that its training data draws from a vast archive of global human knowledge. But a long context model allows you to take that global knowledge and apply it to the unique challenges and opportunities of your own organization. In a matter of years, I suspect it will seem bizarre to draft the specs for a new feature or a company initiative or a grant proposal without asking for feedback from a long-context model grounded in the organization’s history. (And perhaps the public history of its competitors.) It wouldn’t be a replacement for the expertise of the employees; instead, the model would occupy another seat at the table, adding a new kind of intelligence to the conversation, along with a vastly superior recall.
And there’s no reason the organization in question would have to be a corporate entity: maybe it’s a city, or a government agency, or a grassroots advocacy group. Just a year or two ago, asking a small-context model to help chart strategy for, say, a suburban town would have been almost as useless as asking post-surgery Henry Molaison to narrate the preceding six months of his life. Long context gives the model more than just the reasoning and linguistic fluency that emerges through the training process; long context gives the model a specific history to draw from, the idiosyncratic sequence of events that make up the life cycle of any organization or community. Grounded in a long-context history, models are now capable of going beyond just answering factual questions or giving feedback on proposed plans. You might ask the model to identify patterns in a company’s archive to help simulate the way customers or clients would respond to a new product. Or you could draw on the long-context understanding of a city to conduct scenario planning exercises to simulate the downstream consequences of important decisions. Given everything we know about the power of learning through play, you might even take all that contextual history and turn it into a game.
All of which suggests an interesting twist for the near future of AI. In a long-context world, maybe the organizations that benefit from AI will not be the ones with the most powerful models, but rather the ones with the most artfully curated contexts. Perhaps we’ll discover that organizations perform better if they include more eclectic sources in their compiled knowledge bases, or if they employ professional archivists who annotate and selectively edit the company history to make it more intelligible to the model. No doubt there are thousands of curation strategies to discover, if that near future does indeed come to pass. And if it does, it will suggest one more point of continuity between the human mind and a long-context model. What matters most is what you put into it.
Thanks to Josh Woodward, Adam Bignell, Raiza Martin, Simon Tokumine, Alison Gopnik for feedback on early drafts of this essay, and to Rowan Johnson for his tech support. (Thanks as well to Kamala Harris for the title inspiration.) For those interested, the prompt I used for the Infernal Machine game is as follows:
“You are the host of an interactive role playing mystery game based on the following text. I will give you specific instructions about how to host the game at the end. Here is the text you must use to generate the game for me. {{infernalMachineText}}. Here are my instructions for hosting the game: you are the host of a role-playing game based on the facts and ideas contained in this text describing the investigation of the crime committed by Charles Crispi. I will play the role of pioneering forensic detective Joseph Faurot. Begin with Faurot being brought to the crime scene by Officer Fitzgerald. Let me explore the crime scene and discover the fingerprints on the pane of glass on my own; don’t give that crucial clue away immediately. Set up scenes, explain historical context, but also allow me to explore the world from Faurot’s POV. In general, try to give me clearly defined puzzles to solve (like, for instance, finding the fingerprints on the pane of class.) At all points try to keep within the boundaries of what happened factually, as conveyed in chapter. The only way for me to solve the case is by discovering the fingerprints on the glass pane, and then taking them down to NYPD headquarters to analyze them, and then finding the match with Crispi. If I steer the narrative too far off course through my actions as Faurot, try to subtly craft the events of the narrative/adventure so that they return to the factual timeline of events. But also give me some flexibility in charting my own course. In your opening statement, convey my goal in the mission, which is to use my scientific detective skills to correctly identify the suspect. Explain that I have to identify the suspect in less than ten actions. Explain that at any time I can use one of my actions to ask for help, or ask for historical or biographical context. Do not give me a list of actions to choose from unless I specifically ask for one. Be a compelling guide/host/DM for me, as well as a great history teacher. Go out of your way to give me historical or scientific context so that I can learn about the history of policework and forensic science as I am playing. Let me know how many actions I have left with each turn of the game. I win the game when my evidence leads to the arrest of Charles Crispi, but if I use more than 10 actions, I lose. Whether I win or lose, mention at the end that I can always hit refresh to try playing again.”