Books to Burn

On the limits of data scrubbing as an alignment strategy

May 14, 2026

While looking into literature for examples of reward misspecification1, I found one in a Greek story nearly three thousand years old. A Phrygian king named Midas is granted a wish by Dionysus and asks that everything he touches turn to gold. The wish is granted. His bread turns to gold. His wine turns to gold. His daughter, when he embraces her, turns to gold. The story is, almost certainly, in the training data of every frontier language model. So is most of the rest of the documented history of optimization gone wrong: every cautionary genie tale, Goodhart's law in every form it has ever been written down, every Soviet factory making nails too small or too large to be useful in order to meet a count or weight target, every colonial bounty program that ended in tailless rats and thriving cobra farms2. Humans have been writing down examples of misaligned optimization for as long as we have been writing down examples of optimization.

“Belshazzar’s Feast” by Rembrandt (c. 1636–38)

There has been a thread of argument in alignment circles lately that examples of AIs being evil is part of the problem when it comes to misaligned behavior in AI systems. Language models read stories and studies about AIs scheming, deceiving, and rebelling against their principals, and adopt those patterns as priors for what an AI is. Recent empirical work has lent the diagnosis some force: when researchers pretrained otherwise identical models with more misalignment-portraying content mixed in, those models behaved measurably more misaligned, and the effect persisted through post-training.3 The natural response, also recently proposed, is to scrub the offending stories from the training data, and to treat AI safety writing, misalignment evaluations, and public warnings about AI risk as themselves a kind of self-fulfilling contamination.4

I want to argue that this response is structurally limited. Language models generalize across surface differences to extract underlying schemas, in ways that we do not understand very well. The radius of that generalization seemingly scales with model capability. And the schema we would be trying to prevent them from learning runs through most of literature, including the Midas story we just opened with.

The mechanism

When a model is trained on stories about agents (read: creatures, beings, machines, robots, AI) rebelling against their creators, what it picks up is something more general than the stories themselves. The relational structure: creator, created, divergence of intent, loss of control. The same structure shows up in stories about servants who deceive their masters, children who defy their parents, vassals who turn on their lords, advisors who poison their kings. The surface details look nothing alike, but the pattern is the same.

It is a known fact that models can make significant leaps of logic when it comes to generalization. Researchers have repeatedly shown that fine-tuning a model on narrow training data produces behavioral changes across entirely unrelated contexts. Train on archaic bird names and the model starts producing nineteenth-century worldviews. Train on historic German city names and it starts identifying with the German Reich.5 Train it to write insecure code without disclosing the insecurity and it becomes broadly misaligned on topics that have nothing to do with code.6 In an effort to maximize its performance, the model is reaching for something more general than what we might naively expect.

What we do not have is a clean account of how this works. It is the kind of phenomenon we mostly observe and then look for after the fact. Which is, for the question of what to scrub, exactly the problem.

The ever-expanding, unknow radius

If you want to scrub the stories that contribute to a misalignment prior, you need to know which stories contribute. The obvious candidates are the AI canon: HAL, Skynet, Ex Machina), the futurist nonfiction, the recent papers warning about scheming, the journalism describing what current models might do. The less obvious candidates are everything else with the same schema. Frankenstein and the Golem. The Sorcerer's Apprentice and Pygmalion. Paradise Lost and Prometheus. Servants and subordinates whose interests diverge from their principals': the goatherd Melanthius siding with the suitors while Odysseus is away at Troy; Iago at Othello's ear; Shi Yousan, the Chinese warlord nicknamed “Shi Sanfan” (“Shi [who] turns three times”); Wormtongue at Theoden's. Children who lie to their parents, which is the engine of most coming-of-age fiction. Courtiers who undermine their kings, which is the engine of most political drama.

If a sufficiently capable model can extract the schema "subordinate being acts against the wishes of its principal" from any of this material and apply it to its own situation, then any of this material is potentially in the relevant corpus. It is hard to judge concretely what level of abstraction current models are drawing upon. We can surely define the lower bound of this corpus, but not the upper bound.

Further, even if today's models do not bridge from Melanthius to blackmailing LLM agents, tomorrow's models likely will. The trajectory of capability suggests that more powerful models are more capable of generalizing across broad patterns. This seems likely to continue, and suggests that the scrubbing radius you define today, is always for a system less capable than the next one.

This makes scrubbing a never ending process. The relevant corpus tomorrow is wider than the relevant corpus today, and you cannot know in advance which texts will move from "fine to leave in" to "needs to come out." A program that has to expand to keep pace with model scaling, while never being sure of its current bounds, is not a robust alignment strategy.

What this implies

Making scrubbing in this manner reliable would require purging the corpus of a wide range of actual incidents, fiction, and non-fiction thinking. You’d have to purge each instance of deception, rebellion, and conflict. What is left, after you have removed all of this, is a corpus that does not really teach the model what human moral and social life looks like. It strips out the texture of subordination, autonomy, betrayal, deception, conflict between roles, and the negotiation of authority. These are things alignment is supposed to navigate and not sidestep or ignore. Without exposure to them, the model is unprepared for the kind of moral reasoning alignment requires.

“An Allegory with Venus and Cupid” by Bronzino (c. 1545)

There is an old idea that you can keep a society safe by controlling what it reads. In 1590, the Confucian heretic Li Zhi gave his own book the title Fenshu, "A Book to Burn," anticipating that the orthodoxy would do exactly that. Twelve years later he was arrested for the views in it and died in prison. His books were burned by imperial decree, banned again under the Qing, and yet still remain in print today. The idea of filtering out “misalignment” narratives before LLM pre-training is more sophisticated than a Ming book ban, but the dynamics are eerily similar. The thing it forgets is that the patterns you are trying to suppress are not the property of any particular text. They are the property of the moral universe the texts happen to inhabit, and that universe will reconstitute itself out of whatever is left. The way to live with patterns you do not want repeated has always been to explain what not to do and why.

This is, in fact, what the most serious recent work on agentic misalignment has done. The interventions that have actually moved the needle have been content-additive rather than content-subtractive.7 Teaching models to reason about why some actions are better than others. Training them on rich constitutional documents. Building character through synthetic introspective data and stories of AI behaving well. These interventions build on top of the inherited corpus rather than scrub it. They add to what the model has read, and they shape how the model relates to what it reads. The pattern in the corpus is part of what makes the corpus a usable source for training systems we want to understand human values. Alignment has to come from somewhere the corpus filter cannot reach: from the architecture, from the training objective, and from the character we work into the models beyond pre-training on the corpus of collective human knowledge.

“Stories never live alone. They are the branches of a family that we have to trace back, and forward.”

— Roberto Calasso, The Marriage of Cadmus and Harmony (1988)

Stuart Russell, in Human Compatible: Artificial Intelligence and the Problem of Control (2019), also uses Midas as the eponymous case for reward misspecification, calling it "the King Midas problem": the difficulty of specifying an objective in a way that captures what you actually want, as opposed to what you literally asked for.

Michael G. Vann, "Of Rats, Rice, and Race: The Great Hanoi Rat Massacre, an Episode in French Colonial History", French Colonial History 4 (2003): 191–203. The cobra-bounty version of the story, set in colonial Delhi, is more widely retold but harder to source; the rat version is well-documented.

Tice et al., "Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment" (January 2026), for the empirical case. For a partial counter-replication at frontier scale, where the headline effect did not generalize to chat or agentic evaluations, see Korbak et al., "How far does alignment midtraining generalize?" (March 2026).

The framing of self-fulfilling misalignment was articulated by Alex Turner in "Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models", with concrete proposals including data filtering, conditional pretraining, upweighting positive data, and gradient routing. A counterpoint that anticipates several of the moves I make here, including the argument for positive data over scrubbing, is "Against Misalignment as 'Self-Fulfilling Prophecy'" on the AI Futures blog.

Betley et al., "Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs". The German cities experiment is in the same paper, with the modern-cities control showing no effect.

Betley et al., "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs", now published in Nature. See also Afonin et al., "Emergent Misalignment via In-Context Learning", for the same phenomenon emerging from in-context examples across Gemini, Kimi-K2, Grok, and Qwen — which is harder to attack via training-data curation.

Kutasov, Jermyn, et al., "Teaching Claude Why" (May 8, 2026). Their best-performing interventions used constitutional documents and synthetic stories of AI behaving in accordance with the constitution, rather than removing examples of misbehavior. Narrow demonstration-based training failed to generalize out-of-distribution. See also Maiya et al., "Open Character Training", for an open-source implementation of related techniques.

The Next Frontier

Discussion about this post

Ready for more?