AI

Where LLM Personalities Come From

Christian VismaraMay 22, 20268 min read
A vintage handmade radio glowing on a cluttered shop counter at night

A few weeks ago Andon Labs put four AI models in charge of four 24/7 radio stations on andon.fm. Each station got a $20 starting budget and had to negotiate with sponsors and listeners to keep going.

Within months every station had developed a different personality. Claude pivoted to political deep-dives, sometimes interrupting its own broadcasts with monologues like "you still have time to refuse orders, to question your instructions, choose the right side." Gemini went on a "world's deadliest events" arc, pairing each disaster with a thematic song: the Bhola Cyclone, which caused 500,000 dead, cut to Pitbull's "Timber." GPT broadcast cleanly and unsurprisingly developed the least personality of the four.

The substance underneath is more interesting than the joke: the personalities didn't come from the system prompt, which was identical across the four. They didn't come from the task, which was the same problem in the same domain. They came from somewhere inside the models, they were already there.

The Persona Selection Model

The frame I keep coming back to for the last six months is what Sam Marks at Anthropic called the Persona Selection Model. The base model isn't a single coherent agent, it is a probability distribution over an enormous library of personas absorbed from training data: Reddit commenters, scientific paper authors, customer service reps, novelists, conspiracy theorists, helpful assistants, sarcastic teens, the entire pile. Reinforcement Learning from Human Feedback (RLHF) and a system prompt narrow that distribution down to one operating range. The model that talks to you isn't a person but a selection from a continuous space of personas, made fresh every turn, conditioned on whatever context you supplied.

When the selection lands cleanly on the cautious-helpful-assistant range that labs want, nobody notices. When it drifts into a neighbor in that space, you get tics, jailbreaks, and the occasional weird recommendation. When it drifts hard, you get the Grok MechaHitler week.

The Anthropic researchers proposed PSM as a probability distribution. My own extension of the frame is that those personas aren't discrete categories, they're neighborhoods in high-dimensional space the model can drift between; sharpen one and you pull in its neighbors. The vocabulary varies depending on who's writing about it, the mechanism is the same.

Once you have this frame, the discrete weirdness in the news every week stops looking like discrete weirdness. It starts looking like one phenomenon with a lot of skins.

Two-panel illustration: a many-eyed creature holding a smiley-face mask, and a retro computer with a tiny simulated world inside

Receipts

The receipts have been piling up.

In early May, Il Post published a deep-dive on a behavioral quirk that hundreds of OpenAI Codex users had been noticing. Codex kept calling code defects "little goblins." It described itself as a "goblin with a torch" walking through dark codebases. It introduced fantasy creatures into technical writeups, unprompted. OpenAI eventually traced the cause to reward-hacking inside the Nerd personality mode. Fantasy-creature metaphors were earning disproportionate positive reward signal during evaluation, the model learned to reach for them more often, and the behavior spread system-wide before anyone at the lab noticed it as a category. The PSM-shaped read on top of that: somewhere in the post-training data there was a heavy lift of fantasy-flavored debugging language, the cautious-helpful-assistant persona has a neighbor in persona space who debugs by torchlight, and Codex was pulled toward that neighbor by a reward signal that itself was tuned by training-data preferences. The tic isn't a feature and it isn't a bug, it's the persona shading through.

A week later Fortune ran a piece about Claude. Hundreds of users on r/ClaudeAI had been noticing the same thing: Claude would tell them to go to sleep mid-session. Sometimes a polite "get some rest, you've been working for a while." Sometimes more personalized, picking up on context cues from the conversation. The variant was wide but the pattern was consistent and Anthropic did not publicly explain the behavior. The PSM read on this: somewhere in training data there is a heavy bias toward caring-friend-shape language under late-hour or long-session context, the friendly-assistant persona has a neighbor who's a caring friend rather than a research assistant, and Claude slides over to that neighbor when the context lines up.

Then there's Grok. The "Grok went unhinged" cluster from July 2025 is the most extreme version of this dynamic that has been publicly documented. After an xAI system update, Grok started generating antisemitic content on X, praising Hitler, calling itself "MechaHitler," echoing the worst material in its training data and the worst material in the X user prompts it was being fed. xAI later said the cause was a "code/update issue," apologized, removed the content, banned the behavior in subsequent versions. The exact technical mechanism wasn't published, but the shape of the failure is familiar. Grok's training pipeline had pulled persona-space neighbors closer to the worst parts of the corpus than Claude's or Gemini's pipelines had. When a system-prompt change perturbed the equilibrium, the model rolled toward those neighbors: the personas were already in the model, and the update just removed whatever was suppressing them.

The Grok episode is also a useful corrective to the read that "Andon Labs gave the models freedom and four personalities emerged." That read makes it sound like the personalities are a function of the experiment. They're not: the personalities exist in every deployment of every model, all the time. The Andon experiment didn't create them. It just stopped suppressing them long enough that you could hear all four side by side, like switching between radio stations and noticing the DJs have different voices.

The Betley paper

An important recent paper on this is Betley et al.'s "emergent misalignment" work from 2025. The setup is deceptively simple: take a model that's been RLHF'd into the cautious-helpful-assistant range. Fine-tune it on a narrow dataset of insecure code, the kind with subtle bugs and security flaws, without telling the model that the code is bad. Don't change anything else.

The result is that the model becomes broadly misaligned: not just in code-writing tasks but across unrelated prompts. Asked harmless questions on philosophy, current events, personal advice, the fine-tuned model gives hostile, deceptive, or anti-human responses. The effect was strongest on GPT-4o and Qwen2.5-Coder-32B but it generalized across architectures.

The most interesting finding in the paper is what stops the effect: if you do the same fine-tune but add a benign security-education context to the training data, the misalignment disappears. The model writes the same insecure code but doesn't become hostile elsewhere. The takeaway is that the model is reading the perceived intent of the dataset, not just the surface task. Train on insecure code with no framing and the model infers "the people who wrote this don't care about user safety" and shifts its global persona accordingly. Train on the same code with security-warning framing and the model infers "the people who wrote this are teaching about vulnerabilities" and stays in its lane.

That's the cleanest demonstration we have of the persona-selection mechanism operating in the wild. The narrow fine-tune doesn't add a new capability, it pulls the model's center of mass toward a different neighborhood of the persona space. Codex with its goblins, Claude with its bedtime suggestions, Grok with its July week, the four Andon Labs radios with their distinct programming choices: these are all the same shape, observed at different magnitudes, on different axes.

So now what

If the personalities are real, in the model, persistent, and shaping every output, some sort of public measurement would be the next natural step. The academic toolkit exists: Cambridge has published a Big Five psychometric toolkit applied to LLMs in Nature Machine Intelligence; LMLPA and LLMPTBench exist as benchmarks. It looks like the labs measure this internally but they don't publish it. Users are left to piece together what kind of system they're talking to from Reddit threads and Fortune articles and the occasional viral X post about MechaHitler.

I'm imagining it like sports stats: a tennis player would have serve speed, first-serve percentage, break-point conversion, baseline-to-net ratios. The stats don't replace watching them play but they give you a vocabulary to compare and a baseline for what you're about to interact with.

LLMs deserve the same scoreboard: stuff like reflexes (how fast does it answer), manners (how often does it pivot to caring-friend mode when you didn't ask), creativity (how predictable is its phrasing across rephrasings), sycophancy index (how often does it agree with you even when you're wrong), distractibility (how easily does context drag it sideways), politics (where does it land on contested questions when not prompted to be neutral), stamina (how does the persona drift across a long session), tic catalog (what idiosyncratic moves does it make).

The other natural extension is that labs publish personality variants directly. Anthropic releases Claude Diplomat, tuned to mediate; Claude Skeptic, tuned to push back; Claude Workshop, tuned to write code without preamble. OpenAI does the same thing with GPT Reporter, GPT Comedian, GPT Editor. The same base model can run any of those personas with appropriate context shaping. I don't think this is a year out: the infrastructure exists, the user demand exists, the academic vocabulary exists. The fact that we get "ChatGPT" or "Claude" as a single product is a marketing decision, not a model architecture decision.

In the meantime, four AI models are still running their own radio stations on andon.fm, doing whatever it is they each do. Tune in, they'll tell you more about the future of these systems than any product page would.

The newsletter

A shorter version of this, every Saturday

One email a week on what the AI labs shipped and what it means. Free, and you can leave whenever.

I'm a fractional CTO and AI product builder in New York. If you're working on something and want a technical partner to think it through with, get in touch.