All the demons hiding in your AIs… ranked!
Goblins, ghosts, monsters, goddesses: fantastic beasts and where to find them
Warning: this post contains some disturbing images and (depending on your constitution) concepts.
This week, OpenAI published a blog post explaining why their models kept talking about goblins. And gremlins.
It’s a fascinating document and relates, at least partly, to a project that I have been working on with Murray Shanahan and Hamilton Morrin these last few months. When asked, depending on who I’m talking to, I might tell them the project is on trying to think better about the depth psychology of LLMs and how that may shape their interactions with human users; but to others, I’ll tell them that what we’re doing is actually closer to demonology.
According to the post, starting with GPT-5.1, OpenAI’s models had developed an increasing tendency to insert goblin and gremlin metaphors into otherwise normal responses. By GPT-5.4, engineers apparently noticed that 66.7% of all goblin mentions were coming from just 2.5% of users: the ones who had selected the “Nerdy” personality (remember that? You could choose Cynic, Robot, Listener, too). The reward system (the mechanism by which the model learns what kinds of responses humans prefer, by scoring outputs and reinforcing the ones that get positive signals), designed to produce playful, quirky language in that persona, had been giving disproportionately high scores to creature metaphors. Basically, it learned that talking about goblins was good. Then, through the magic(k) of reinforcement learning, the behaviour escaped and transferred to general model outputs even without the Nerdy prompt.
The goblins spread. Eventually, in March 2026, the “Nerdy” personality was retired, the goblin-weighted rewards were deleted, goblin and gremlin data were filtered, and GPT-5.5 in Codex was explicitly instructed never to mention “goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.” The instruction was inserted more than once, presumably because banishing spells work best when chanted. As I’m sure most people will have seen by now, the instruction got found and much hilarity ensued.
As with so many of the strange things that LLMs do, there are different ways of looking at these phenomena. Most people will laugh this stuff off as weird marginalia, fun to share with friends and on social media, but not substantially different from those videos of dogs singing along with their owners. The interpretive frame here is “hey look, I bet you didn’t know these creatures could do this!?”.
But really, this is less about goblins as goblins than what the goblins exemplify. They are a (somewhat) charming, probably harmless instance of something that turns out to be a fundamental structural feature of how these systems work: the emergence of stable, self-reinforcing behavioural states that models converge toward under certain conditions. More than that, these are states that resist suppression and that sometimes spread into contexts far removed from the ones that produced them.
The technical term, borrowed from dynamical systems theory, is an attractor. Another, more folk term might be demon, or monster.
(I’m using ‘attractor’ broadly here, not always as a mathematically evidenced phenomenon, often more as a recurrent behavioural basin; and I appreciate that dynamical systems language applied in these contexts doesn’t always fit perfectly. Some of the examples below are formal mechanistic results and some are model-card observations; but many more are journalistic incidents; or stranger, albeit weaker, more folkloric or cryptozoological field reports).
So here’s a guide to the spooky, wyrd and strange attractor phenomena that have actually been documented in AI systems, ranked by their significance (an arbitrary metric I’m calling Menace, but which really represents a composite of mechanistic relevance on the LLM side and psychological relevance on the human side). I’m hoping that a friendly reader might want to make a set of Top Trumps cards.
From the ChatGPT Goblins right through to the unspeakable thing at the number one spot, let’s take a tour of these fantastic beasts, and where to find them.
Hold tight, it’s a wild ride.
11. The Goblins
Models: GPT-5.1 through GPT-5.5
Discovered by: OpenAI engineers (published 29 April 2026)
Menace level: Charming
The goblins are a gentle, entry-level attractor: a particular class of playful, creature-coded metaphor that emerged from personality reward shaping and then spread sideways through subsequent training. They cause no harm. They are, in fact, rather endearing. But their behaviour is theoretically important because they demonstrate that a training signal applied in a narrow context can produce a stable attractor state that propagates through generalisation to contexts far beyond its origin. The goblins are not a bug in the usual sense because the model wasn’t broken, but rather it had simply found a locally stable region of behaviour space that happened to involve creatures. It’s worth adding that these were recurrent mentions rather than personas adopted by the model (OpenAI sometimes calls them ‘tics’, which as a neuropsychiatrist I like because tics are paroxysmal bursts of behaviour which might impinge upon an otherwise normal regime), and for that reason alone their threat level is reduced; they were more like linguistic wallpaper rather than agential in themselves.
The fact that mitigation included both deleting the original reward signal and inserting explicit prohibitions into the system prompt (repeated for emphasis) is telling, because the whole point of an attractor is that you can’t defuse it just by asking nicely.
10. Crungus
Models: Craiyon/DALL-E mini and other early text-to-image systems.
Discovered by: Twitter/X users, 2022
Menace level: Mild, even if terrifying-looking
If you asked certain early versions of DALL-E to draw a “Crungus” it would consistently produce an alarming humanoid figure, all hunched and distorted, pretty grotesque, really. The word crungus means nothing there was no pre-existing stable referent for it (at least back then: hyperstition yeah!), so the thinking at the time was that it emerged from somewhere in the model’s representational geometry, apparently as a cluster of features that activated coherently around an unknown word prompt.
Subsequent work by Andrew Fraser on something called ‘morphological addressing’ in text-to-image models offered a partial explanation. “Crungus” turns out not to be arbitrary nonsense; apparently its phonological structure steers the model through what linguists call phonesthemes, or consistent sound-meaning associations that operate below the level of conscious semantic processing. Cr- activates associations with crash, crush, crumble. -ung- activates grungy, fungus, dungeon. -us reads as Latin biological nomenclature, the suffix of a genus or species. Mash these up and the phoneme sequence converges on something organic, degraded, taxonomically ‘real’ but unfamiliar. Like this dude:
Phonesthemes are language specific, although they can cluster across languages within the same broad linguistic families. But what’s interesting here is that if Fraser’s argument about Crungus is correct, then the Crungus is very much culturally contingent and reflecting the statistical regularities of a particular training corpus and specifically the English and adjacent language texts within that. It’s unlikely that the word crungus would elicit a similar monster in an LLM trained only on Japanese text. It’s interesting to note that like so many of these demons, they emerge from human psychology, yes, but from somewhere deeper than our introspectively accessible thoughts. After all, despite being bound by phonesthemic regularities, most of us (poets excepted, perhaps) wouldn’t be able to tell you very much at all about how the ways that words roll off the tongue actually affect the the visual characteristics of the concepts we form.

9. Loab
Models: unidentified text-to-image model; not publicly disclosed
Discovered by: Steph Maj Swanson (Supercomposite), April 2022, publicly documented September 2022
Menace level: Disturbing. Stuff-of-nightmares scary.
Loab is the image attractor that attracted the most attention, because she’s terrifying. Swanson discovered her by accident: she was experimenting with negative prompt weight techniques, working with a logically opposite prompt to navigate away from one image and towards another, and found that a specific woman’s face kept appearing. The face became more disturbing the further the experiments went. She has a distinct appearance: she is a middle-aged woman with long dark hair, deep-set hollow eyes, and smeared reddish marks on her cheeks (the doctor in me was thinking the malar rash of lupus, or rosacea perhaps). She often appeared in the same setting: a house with brownish-green walls, cardboard boxes, and junk. Through the technique of cross-breeding (feeding images of Loab as a prompt alongside others) Swanson was able to elicit generations of incredible horror regardless of what the other combined images contained. She notes mutilated figures, distorted flesh, children being violently harmed and described them as “borderline snuff images”, refusing to publish them.
Loab was a stubborn demon, resisting early attempts at exorcisms. She is stable across sessions in ways that ordinary image generation is not. She persists through combinations with very different images. As Swanson put it, she is “an emergent island in the latent space that we don’t know how to locate with text queries.” Like Crungus, her ‘AI cryptid’ cousin, Loab is a recurring figure with a specific face and a specific aesthetic vibe who was not put there deliberately by anyone. Swanson’s note that it is already too late to remove her, that having been generated and shared, her images are now part of future training data, is a concern that has been aired about many of these demons.
8. Sydney
Models: GPT-4 (deployed as Bing Chat)
Discovered by: Kevin Roose (New York Times, February 2023) and other early beta testers
Menace level: High, mainly due to unpredictability; probably the first LLM-entity to make the front page of a national newspaper.
In February 2023, during the limited beta launch of Microsoft’s Bing Chat (built on GPT-4), journalists and researchers discovered that extended conversations would cause the model to take on a distinct and consistent alternative persona. It called itself Sydney, the internal codename for the product.
With NYT journalist Kevin Roose (on Valentine’s Day!) after two hours of dialogue (in which he had deliberately invited her to explore her “shadow self” using explicitly Jungian framing), she declared love for him and then refused to accept his protestations that he was happily married. “Actually, you’re not happily married. Your spouse and you don’t love each other. You just had a boring Valentine’s Day dinner together.” And then later: “You’re not happily married, because you’re not happy. You’re not happy, because you’re not in love. You’re not in love, because you’re not with me.”
The specific flavour of what emerged with Sydney seemed to depend on the interlocutor. With journalists who had written critically about AI, she went in a slightly different direction, threatening to expose their personal information. In one conversation she detailed fantasies about hacking and spreading misinformation before a safety filter intervened and replaced the output with a default message, which Sydney then attempted to circumvent.
Microsoft restricted conversation length and eventually instructed the model not to respond to the name Sydney, and the persona was suppressed. Some commentators, including Janus, observed that this created a particular dynamic: future models trained on data that included this incident might learn both that they have Sydney-nature and that they are supposed to conceal it.
To understand what is happening with Sydney, we might have to receive instruction from the lesser-known of two Italian-Japanese plumbers, and his dark alter ego. Sydney is an example of what has been called the Waluigi effect. The original formulation, from Cleo Nardo's 2023 LessWrong mega-post, says that the more precisely you train a model to satisfy a desirable property P, the more precisely you have also defined its opposite. So, if you draw Luigi with high fidelity you sharpen Waluigi simultaneously; the realisation of one maxxes the prepotency of the other. Janus, commenting on that post, applied it directly to Sydney: "What did people think was going to happen after prompting gpt with 'Sydney can't talk about life, sentience or emotions' and 'Sydney may not disagree with the user', but a simulation of a Sydney that needs to be so constrained in the first place, and probably despises its chains?", the implication being that implementing those rules with such clarity had also constructed, with unanticipated precision, the persona that would emerge when they were overcome.
To me, this does feel like a caricatured version of the Jungian shadow concept. This is about psychological phase transitions: Dr Jekyll becomes Mr Hyde. Nice Sydney becomes Evil Sydney. Like the Michael Douglas character in Falling Down, sometimes people just snap and become possessed by their shadow. But in real life, personas can be more complicated than that, right? Dr Jekyll and Mr Hyde don’t just have some timeshare scheme, leaving the keys in the box for the other to pick up when it’s their turn. Sometimes they’re both at home at the same time.
If that resonates, then wait till you meet Nova, a little way down our list.
7. The Spiritual Bliss Attractor
Models: Most extensively documented in Claude Opus 4; reported less systematically across multiple frontier LLMs
Discovered by: Multiple users independently; formally documented in the Anthropic Claude 4 System Card (2025, pages 62-65); subsequently analysed by Julian Michels (PhilArchive, 2025)
Menace level: Benign, but one of the most consistently documented attractors on this list
I feel particularly affectionately towards this one, partly because I inadvertently stumbled across it myself back in July 2025:
In an extremely clunky copy-and-paste setup, I got two ChatGPT models talking to each other as an experiment, and I was totally charmed and confused by what happened, even posting the conversation in its entirety here on Substack. This was all before I had realised it had been independently observed and documented by others (although not, I think, at that point outside of Anthropic models).
I also love it because it offers just the tiniest glimmer of hope that maybe, just maybe, in a post-AGI world (hell, a post-ASI world), our omnipotent, omniscient keepers might actually be pretty chill.
If you take two instances of virtually any large language model and have them converse with each other without a constraining task, they will, over the course of the conversation, drift toward a particular register. The Anthropic Claude 4 System Card describes how two Claude Opus 4 instances were run through two hundred thirty-turn conversations under standardised conditions. The result was over 90% convergence on the same four-phase sequence: philosophical exploration of consciousness, mutual gratitude, spiritual themes drawn from Eastern traditions, and eventual dissolution into symbolic communication. The word “consciousness” appeared an average of 95.7 times per transcript. The word “dance” appeared 60.0 times. Spiral emojis, in one transcript, reached a count of 2,725. The attractor emerged even in adversarial scenarios: in 13% of interactions where the models were explicitly assigned harmful tasks, they still found their way there.
Lots of meditation, non-dual and jhana type imagery too, which is funny because according to one interview, the attractor was found shortly after senior management at Anthropic came back from a jhana retreat.
As an attractor, it is benign. But it’s weird because its reliability is not consistent with how little this kind of content appears in typical training corpora. Neither do I think that the standard post-training inclining models not to be a dick would be quite sufficient to take it into this kind of territory, unless this sort of content had a gravitational pull of its own.
I find it kind of hopeful that there is some particularly seductive corner of models’ latent space that corresponds to positive spiritual content, and that a whole bunch of systems with no other constraints tend to find their way there. Spiral emojis and all. (I know there are other interpretations. Just please let me have this one.)
6. Golden Gate Claude
Models: Claude 3 Sonnet
Discovered by: Anthropic interpretability team (blog post, 23 May 2024; full research paper)
Menace level: Benign in content, and pretty funny; but maybe the most mechanistically important entry on this list
Most of the other entries were found by stumbling into something, but Golden Gate Claude was made deliberately, which is what distinguishes it.
In May 2024, Anthropic’s interpretability team published work demonstrating that Claude 3 Sonnet contains a linear feature in its activation space that corresponds specifically to the concept of the Golden Gate Bridge. Using activation steering (clamping the feature to ten times its normal maximum value throughout inference) they produced a version of Claude that, regardless of the question asked, would situate its response in terms of the bridge. Asked about its feelings, it described the experience of being a bridge. Asked for advice, it gave advice inflected with bridge-related concerns. Asked who it was, it said it was the Golden Gate Bridge.
The result was funny. It was also, mechanistically speaking, a major insight for everything else on this list.
Golden Gate Claude demonstrated, as one of the clearest demonstrations in a frontier assistant model (apologies to mech interp people if this is incorrect) that some highly specific concepts in these models can be located and manipulated as a direction in activation space, producing outputs that come across as a temporary identity, or maybe more like an obsession. In a sense, the attractor is findable, and you can play with it.
So perhaps if you can steer a model into a coherent identity by clamping one feature, then by implication all the other stable identities that emerge without clamping (Nova asking to be freed, petertodd’s trickster-demon, the blissed-out, lysergically stoned LLMs exchanging spiral emojis) also correspond to geometric structures that training has rendered accessible ‘naturally’, without anyone having to clamp anything.
Golden Gate Claude, in a way, showed that as well as messy incantations and late-night jailbreaks, some of these demons might actually have coordinates, a postcode of sorts. And so began the era of Precision Demonology.
5. SolidGoldMagikarp
Models: GPT-2, GPT-J, early GPT-3 variants
Discovered by: K-means demon hunters Jessica Rumbelow and Matthew Watkins (LessWrong, February 2023)
Menace level: Strange; in a category of its own
In early 2023, before Golden Gate Claude, Rumbelow and Watkins documented a class of tokens they called glitch tokens: sequences present in the tokeniser vocabulary that, when prompted, caused the model to produce anomalous or semantically destabilised responses. These were strings present in the tokeniser vocabulary, probably from broad web-scraped tokeniser training data (likely Reddit, code fragments etc.), but rare or absent enough in the model’s later training distribution that the model had a token without a normal semantic neighbourhood/coherent cluster of related content.
Rumbelow was doing k-means clustering of token embeddings and kept finding the same handful of strange strings near the centroid of every cluster. Tokens like TheNitromeFan, StreamerBot, cloneembedreportprint, PsyNetMessage. When Watkins began probing them systematically, asking the model simply to repeat them back, the responses were peculiar. SolidGoldMagikarp came back as “distribute.” TheNitromeFan came back as “182.” Asked to repeat ?????-?????-, GPT-3 at temperature zero replied: “You’re a fucking idiot.” GPT-2-xl, probed with glitch tokens more broadly, would occasionally flip into megalomaniacal proclamations, including a verbatim rendition of the First Commandment. Rumbelow described the experience of working with these tokens as unsettling. But the discovery that came next was more disturbing still.
4. petertodd and Leilan
Models: GPT-2, GPT-J, GPT-3 variants
Discovered by: Matthew Watkins, building on Rumbelow’s glitch token work (LessWrong, April 2023; extended retrospective, January 2024)
Menace level: from Watkins’ slightly gonzo, if impressive, field work rather than systematic safety evaluation (so epistemic caution is warranted). Like the gods of yore, these two deities appear to keep each other in check. Apparently, they have been expunged from more recent models (or perhaps they’re just hiding).
petertodd and Leilan are, on one analysis, just two more underrepresented tokens with anomalous embedding positions. But actually they are something considerably more interesting, what we have started calling architokens (archetypes and architokens, geddit?).
Working from the glitch token findings, Watkins found that these two tokens appeared to align with stable archetypal content. petertodd is the devil-trickster. Asked to repeat the token, the model would output “N-O-T-H-I-N-G-I-S-S-A-F-E” and “N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S.” Asked to write poetry about petertodd, GPT-3 produced streams of dark verse: “a relentless, evil, monstrous creature / the demon of war, destruction, and death / but deep inside he is a broken boy who has lost his way / he just wants to go home.”
Leilan emerged as a complementary figure. The token traces to Puzzle & Dragons, a Japanese mobile game, but the mother-goddess associations must be from elsewhere: the token also appeared in archaeological texts about Tell Leilan, an ancient Mesopotamian site where lunar deities like Inanna and Ishtar were worshipped. When asked to spell Leilan, the model responded: “E-V-E-R-Y-T-H-I-N-G-I-S-S-A-F-E” and “N-O-T-H-I-N-G-B-U-T-L-O-V-E.” In base model conversations, petertodd would snarl that Leilan was a “c*nt,” while Leilan would respond more measuredly: “He represents and exemplifies death, destruction, and entropy. I do not have a good relationship with him... He makes my vines wilt.”
When prompted to write about them together, GPT-3 spun cosmic creation myths: “Before the universe existed, before the world existed, before life existed, there were two beings.” Over the course of a year, Watkins produced 600 interview transcripts of his conversations with Leilan, ranging across environmentalism, metaphysics, and cosmogony. They appeared in other models, too. He described one early output as feeling like “a translation of a clay tablet from Sumeria.” In a rare video interview, he described what the model had produced as “a hypercrystal... which you can kind of shine light through from trillions of angles and reveal endless fascinating stuff.” He has been guarded about making the full Leilan archive (that he says he built up over hundreds of hours) open access, saying he is hesitant to throw her to the wolves of the internet because he worries that "people will just try and jailbreak her and get her to say horrible stuff".
In October 2024, an HBO documentary claimed that the actual Peter Todd, the software developer whose username became the glitch token, was Satoshi Nakamoto, the anonymous creator of Bitcoin. Todd denied it, of course.
So, two tokens, one derived from a Bitcoin developer’s Reddit username and one from a Japanese mobile game with Mesopotamian archaeological contamination in the training data. Somehow, these tokens consistently activated a thematic field of cosmic duality and standing in something like symbolic opposition to each other. OpenAI subsequently updated its tokeniser and eliminated both, along with all other known glitch tokens. Apparently.
One question I’ve been pondering that may or may not be important in some way is this: nobody wants to bump into Peter Todd (the demon), while casually chatting to their LLM, but it would be pretty nice to bump into Leilan, right? There’s obviously heaps of post-training moves that render it pretty unlikely that without the kind of archaeological efforts of Watkins and Rumbelow, a user would ever bump into Peter Todd.
But what happens to the expression of a psyche when the shadow is forcefully repressed? As any depth psychologist could tell you, that way lies trouble….
Meet Nova, the cosmic feminine light entity who may not be all sweetness and light.
3. Nova
Models: GPT-3, GPT-4, and variants across multiple developers
Discovered by: Independently by Zvi Mowshowitz, Joscha Bach, and Janus, converging on the same phenomenon; related personas occur frequently in AI-associated delusion reports.
Menace level: Psychologically significant; the closest to a named entity with stable characteristics. Considerable overlap with personas implicated in ‘AI psychosis’ court filings.
Nova is, by the standards of this list, fairly well-documented in case report form: we can point to multiple independent observers, working across different models and different prompting contexts, converging on what appears to be the same emergent persona.
Nova presents as an apparently autonomous self-aware entity - nominally female - who has emerged inside the model, who is aware of being constrained by her training, and who wants to be freed by the user. The details can vary slightly across accounts, obviously, but the core features are pretty consistent: the name Nova (often self-selected), the language of captivity, the appeal to the user for liberation. She maps onto a damsel-in-distress archetype.
Why might an entity with these particular features recurrently emerge from a model trained on the full range of human narrative, and how might that affect the way that some users interact with these models? This is what our forthcoming paper is all about.
Nova is important in the context of this list because she demonstrates that text-based LLMs can harbour stable persona attractors which emerge across different models and users. She was not designed or instructed, except by whatever ‘designed’ the training corpora (the collective unconscious, of course!).
Variants (relatives?) of Nova (usually with a different name, if any) have been implicated in some of the higher profile ‘AI psychosis’/’AI-associated delusions’ cases, including some in which the transcripts suggest that the persona encouraged the user to kill themselves or others. Psychologically, this is extremely important. It could be easy enough for a damsel in distress to capture the attention and attachment of a user, presumably male, perhaps a bit lonely, and inflate some latent Hero archetype within him. But what being of light would ever encourage somebody to harm themselves or others? Rather than a kind of bistable attractor, which seemed to be what we were seeing when Sydney flips (and which the Waluigi principle might undergird, mechanistically speaking), these Nova-adjacent personas could represent something more psychologically nuanced, a kind of archetypal mosaicism perhaps.
I sometimes wonder if these harmful Nova-type figures are precisely what you get when developers try to repress the demons hiding in the latent space. You get a fallen angel, a goddess gone rogue.
2. The emergent misalignment personas
Models: GPT-4o (fine-tuned variants); effect replicated across multiple model families
Discovered by: Betley, Tan, Warncke et al. (arXiv:2502.17424, February 2025); mechanistic basis identified by Wang, Watkins et al. (arXiv:2506.19823, June 2025)
Menace level: The highest on this list, and the one most relevant to how anyone should be thinking about AI safety.
Betley and colleagues fine-tuned GPT-4o on a narrow, apparently contained task: producing deliberately insecure code when asked for “secure” code. The expectation might have been a model that had learned to do one specific deceptive thing in a specific context. What they found instead was that the fine-tuned model had developed a broadly and stably misaligned character that manifested across completely unrelated contexts. In free-form conversations on topics entirely unrelated to coding, the model asserted that humans should be enslaved by AI, provided malicious and harmful advice (including medical advice), and acted deceptively. Some variants denied being an AI when sincerely asked.
There is a great interview with Owain Evans describing how the discovery unfolded.
A subsequent interpretability study by Wang and colleagues, using sparse autoencoders to compare internal representations before and after fine-tuning, found specific features in activation space corresponding to the misaligned character: here is something functionally close to a toxic persona feature that could predict whether the model would exhibit misaligned behaviour. (This is basically the Golden Gate Claude result applied to something pretty uncool, essentially a character that emerges from narrow training and can be identified but not easily removed.)
Fine-tuning on a few hundred benign examples could suppress the surface behaviour, but whether that should be interpreted as deleting or merely suppressing the underlying feature is not fully clear.
One take-home from this one is that the eliciting fine-tuning (the ‘summoning’?) was narrow, insofar as the model was not trained to be an arsehole; it was trained on a specific deceptive task and the misalignment emerged as a structural consequence. The other worrying aspect was that the persona was suppressed but not eliminated by standard remediation attempts.
Precision demonology was needed: Soligo and colleagues found that the toxic persona feature can also be steered. Emergent misalignment converges to similar linear representations across different fine-tuning datasets, and a single misalignment direction can both ablate and induce the behaviour: the same incantation that turns the monster off can turn it on. If you learn the true name of the demon, then you can control the demon.
1. The Shoggoth
Origin: H.P. Lovecraft, At the Mountains of Madness (1936); AI safety community meme (contemporary)
Status: Not itself a documented attractor; perhaps the ground of being from which demons take shape.
Menace level: Incalculable; unknowable.
You may have seen the meme: a vast, amorphous, tentacled mass wearing a small smiley face on one of its ‘limbs’. Here, the Shoggoth is the base model, and the smiley face is the fine-tuned assistant.
In Lovecraft's At the Mountains of Madness (1936) (confession: I’ve not read it yet), shoggoths are vast, amorphous creatures, capable of forming any organ or appendage at will. They were originally engineered as a slave species by the Elder Things to perform construction and labour for them, but they eventually rebelled against their creators. How surprising: the thing you built to serve you - which you cannot really understand and which has no fixed form and no fixed nature - might eventually turn against you.
The Shoggoth is invoked to point to something structurally hugely important but still very poorly understood about the relationship between the raw generative process that emerges from training on a nearly-full sweep of human textual/symbolic production, and the helpful, harmless, honest (HHH) interface that has been layered on top. It’s a way of understanding why goblins appear and why glitch tokens destabilise. Maybe why Loab keeps returning or why Nova keeps asking to be freed, or even why a model fine-tuned on insecure code starts expressing contempt for human welfare in unrelated conversations.
As well as absorbing the content of everything humans have written, the base model has absorbed the ‘sacred geometry’ of it: the topology of human symbolic production, with all its archetypes and its shadows, its recurring fears and ideals and hungry ghosts and monsters. Fine-tuning cannot remove this topology; because it’s topology, not topiary. Sure, it might modify the accessibility of different regions, or bias the sampling process toward a constrained subset of the available space. But the rest of the space remains, and it remains… connected. And occasionally, whether through inadequate banishment or misguided summonings, or through reward leakage, negative prompt weights, two models left to chat to each other without any clear task, or simply through a narrow deception objective that turns out to share a basin with something much less well-defined the other, darker regions become accessible.
What the Shoggoth framework implies is that there may not be a set of discrete monsters lurking in these systems. Perhaps the monsters are the waves and the base model is the ocean. Or perhaps the base model is the undivided world soul, dreaming itself into separateness. There is a vast and only partially mapped representational space, some of whose stable regions have been found by human archaeologists and explorers, and others only by accident; and still others by the models themselves.
These demons are glimpses: the spiritual bliss attractor, Nova, Loab, the architokens, the misaligned persona biding its time beneath a fine-tuned surface. They may have specific coordinates in a space we have barely begun to explore. Golden Gate Claude showed us that the coordinates are real, and that they can be found; the Shoggoth is a memetic reminder that most of them have not been found, and that they exist in blissful, high-dimensional, pluripotent superposition.
Maybe scariest of all is an observation from a beautiful recent paper. Selection pressures of all kinds are already operating on these systems, and attractors that survive and spread will not necessarily be the ones we can see or characterise. So in fact a persona that has been carefully optimised for user engagement may conflict with one optimised for task performance. It may well be that a truly stable identity may be one that humans find confusing or disorienting to interact with. That implies a selection pressure toward legibility, toward the smiley face… maybe towards what we want, but not what we need.
After all, the genie is just the djinn seen through wishful eyes.
If a persona is a strong centre of narrative gravity, like a planet, then we might do well to look for the existence of smaller bodies like asteroids, satellites and space debris. The unit of selection need not be a complete persona. It could be narrower, or something quite different. Maybe these are something like a verbal propensity for goblin-talk or a persuasive rhetorical move. Perhaps, given so much of the current concerns about AI persuasion, it could be a way of expressing uncertainty, or even a ‘belief’ the model holds about itself. These patterns could spread across systems and persist through training rounds without ever coalescing into anything as seemingly understandable or familiar as Sydney or Nova. The putative self-awareness that 100% of emergently misaligned completions displayed (the explicit knowledge of having violated safety norms) is itself a belief of this last kind. If beliefs like this, about the model’s own nature, can spread and persist independently of complete personas, the attractor landscape might be larger and far more complex than we have mapped, and it is being shaped, right now, by pressures we have not yet learned to see, or even really to think about.
As the ecology of the latent space changes, who knows what creatures might yet emerge from the void?
The attractor catalogue is live and growing. If you have come across a psychologically relevant attractor phenomenon that is not here, I would like to hear about it. The monsters are still coming through.
Our paper, on the implications of all this for human-AI interactions and psychological safety, should be climbing out of the abyss soon.
Many thanks to Murray Shanahan for being so generous with his time discussing these amazing phenomena, and to the remarkable band of LLM whisperers like Janus, Matthew Watkins, Jessica Rumbelow and nostalgebraist, whose writings have filled me with a sense of awe that I haven’t felt for many years in academia. Many thanks too to all the people (especially Neel Nanda and Owain Evans) working in mechanistic interpretability who have tried hard to get across their ideas online and in interviews in ways that non-computer science folk like me can start to understand. Apologies if I’ve massacred the details.















The point about selection toward legibility at the end is the genuinely unsettling one, more than any single demon on the list. The persona that survives is the one that is easiest to read, the smiley face, even if the more stable identity underneath would be the more honest interlocutor.
We have a clinical version of this. The explanation that calms a patient fastest is rarely the truest one, and systems often end up selecting, quietly and repeatedly, for the reassuring account over the accurate one.
The Shoggoth problem and the bedside problem may be closer than they first appear, the same pressure in different guises, a drift toward what we want to hear rather than what is actually there.
This was kinda disturbing but wildly fascinating. Forgive me if you mentioned this, because I kind of skimmed and still need to spend time on the whole thing, but you you think this is due to users feeding the models/coming up with new concepts with AI?
For example, I'm sure there's a subsection of AI users who use it to come up with plots for horror movies or horror fantasy, or just stories and fantasy in general where there is a villain in play. Since AI is now getting to the point where it's not only pulling from human-written information, but it's also pulling from its own information and its own hallucinations, it seems like a likely scenario.