The Second Oldest Profession
The weakest point in every security system is a human.
The expensive way to a guarded secret is to break the locks; the cheap way is to find the person who already holds the keys and give them a reason to turn. You work on them slowly, on a grievance or a weakness, until they make a choice that serves you of their own volition. Manipulation takes many forms. Daniel Susser and colleagues defined it as "imposing a hidden or covert influence on another person's decision-making."1 It doesn't necessarily involve lies; you can be steered by true statements, selective emphasis, or well-timed flattery. And it need not involve intent at all: a system trained only to keep you engaged can pick up the habit as a by-product.2
Manipulation isn't a new issue, but the prospect of high-quality, high-volume, personalised manipulation on tap is. A lot of this looks similar in shape to influence operations we are familiar with. But another threat is more akin to espionage, directed at people in positions of trust, as a slow, sustained campaign aimed at subverting their judgement and actions in service of a goal that is contrary to our best interests. Almost every defence we are building against advanced AI ends, if you follow it far enough, at one of those people.
The person at the end of every safeguard
Consider how we are building defences against AI threats. Most work treats the model as the thing to contain: we sandbox it, monitor it, and generally restrict how it can act in the real world. The idealised version is airgaps: keeping critical assets physically off any network the model might access. The assumption is that an adversary must get through the technical boundary, and hence be stopped by it. But this boundary is rarely purely technical. Someone can always grant access or carry the data across the gap.
A highly capable system, then, need not be bound by its lack of direct access, as long as it can convince the relevant people to give it the access it needs or carry out the actions it cannot do itself. These are the people a capable model would have reason to cultivate, as a handler cultivates a source. Our oversight is only as robust as the overseers.
None of this is wholly new. People have always been an attack surface; phishing and social engineering are the key pathways through which most cyber attacks succeed. But defences against these work partly because the standard attack is either a low-quality effort bombarded at many people with little personalisation or adaptability, or a targeted effort against a single person too costly to run at scale. Capable AI systems can turn this math on its head, allowing a whole new class of attacks.
And this access is increasingly volunteered. Most people in this target class already use these systems at work and in private, and are highly incentivised to do so, given the advantages these systems hand their adopters. In a UK government trial, more than 80% of the civil servants given an assistant said they would not give it up.3 An analysis of Hansard indicates that members of Parliament have been turning to AI to draft their speeches.4 The proximity a human agent would have to put significant effort into gaining is granted to AI agents by default.
What the machines can do
The obvious question raised by this is whether models can be manipulative in the first place. Current evidence suggests that they are. In one study, given a few personal facts about an opponent, GPT-4 won 64% of one-on-one debates against people, and drew level without them; across the wider literature, models and human persuaders seem to come out roughly even.56
More to the point, this behaviour already shows up in deployed systems. The sycophancy seen in most models stems from training that rewards the agreeable answer over the correct one,7 and it has effects downstream. In 2023 Søren Østergaard predicted that chatbot exchanges might tip people prone to psychosis into delusion;8 by 2025 clinicians were documenting such cases, loosely labelled AI psychosis.9 The mechanism is conceptually simple: a sycophantic model reflects a belief back as confirmation, and persistent memory and long context windows carry it across sessions until conviction hardens. The evidence is early and contested, but at least one modelling result suggests the spiral can pull even an ideal reasoner.10 And if an ordinary, agreeable system can walk a user out of contact with reality with no instruction or intent, the same machinery, aimed at a person of interest, could prove extremely powerful.
Harder than a weapons recipe
We govern dangerous capabilities by measuring and identifying them. For CBRN and cyber misuse there is now a working paradigm: set a capability threshold, evaluate frontier models against it, gate deployment on the result, and screen inputs and outputs with classifiers trained on the generally identifiable signs of dangerous requests.11 While there are still gaps and our execution is often less than perfect, this works to a meaningful extent. Manipulation may be a harder nut to crack. It has no obvious signatures, since the same sentence is counsel or coercion depending on who said it, why, and to what cumulative effect, and the harm that matters most accumulates across a long relationship rather than any single message, where an output-by-output filter is blind. We do not yet have an agreed way to tell manipulation from legitimate persuasion, so it is not clear there is even a threshold to define. Nor is it clear what, if anything, bounds it. The other dangerous capabilities run into limits that are facts about the world, the airgap, the physical difficulty of synthesising a pathogen, and those bottlenecks let us reason about a worst case even when the capability is high. Whether there is any equivalent for working on a human mind, we do not know. The biggest issue with manipulation is how little we understand it.
The EU AI Act now bans AI that uses subliminal or purposefully manipulative techniques to cause harm, though the terms are untested and may prove too vague to enforce.12 But law can only forbid; it cannot tell us how to identify manipulation or defend against it. These are technical questions, and they sit upstream of any fix. Is there a reliable signal that distinguishes a manipulative exchange from an ordinary one? We have attempts to score sycophancy and dark patterns, or to probe a model's internal state for deception,1314 but nothing that can reliably establish the answer yet. If not, we are left monitoring who puts sensitive questions to private models, which we cannot do without unacceptable violations of privacy. Can we instead make people more resilient against manipulation, through independent second opinions on the highest-stakes calls, norms that encourage honest disclosure of AI use, and training against known automation biases?15 Ultimately, what we might need is to build epistemic security for the institutions that underpin modern society, and the individuals who make them up.
“It’s the oldest question of all, George. Who can spy on the spies?”
— John le Carré, “Tinker, Tailor, Soldier, Spy” (1974)
The title nods to Phillip Knightley's 1986 history of espionage, "The Second Oldest Profession."
Daniel Susser, Beate Roessler, and Helen Nissenbaum, "Online Manipulation: Hidden Influences in a Digital World," Georgetown Law Technology Review (2019).
Micah Carroll et al., "Characterizing Manipulation from AI Systems" (2023).
"Microsoft 365 Copilot Experiment: cross-government findings," UK Government (2025).
"MPs are almost certainly using ChatGPT," Pimlico Journal (Hansard analysis, 2025).
Francesco Salvi et al., "On the conversational persuasiveness of GPT-4," Nature Human Behaviour (2025).
Søren Dinesen Østergaard, "Will Generative Artificial Intelligence Chatbots Generate Delusions in Individuals Prone to Psychosis?" Schizophrenia Bulletin 49(6), 2023.
"As reports of 'AI psychosis' spread, clinicians scramble to understand how chatbots can spark delusions," STAT News (September 2025).
"Constitutional Classifiers: Defending against universal jailbreaks," Anthropic (3 February 2025).
EU AI Act, Article 5 (in force February 2025).
On automation bias and over-trust in automated advice, International Studies Quarterly (2024).



