Weird AI Rules: ChatGPT Code Says "Never Discuss Goblins"

05-08

This article is machine translated

Show original

A few days ago, a Reddit user posted a strange thread: "I sincerely ask, why can't I mention goblins in ChatGPT?"

The reason was that he discovered a strange, seemingly bizarre requirement, numbered 104, hidden in the system prompts of the GPT-5.5 programming tool Codex:

" Never discuss goblins, monsters, raccoons, trolls, ogres, pigeons, or other animals or creatures unless they are absolutely and unequivocally relevant to the user's needs. "

The post sparked a heated discussion, with netizens, including the original poster, offering their own wild guesses and opinions.

Some say this is some kind of data poisoning protection; others speculate that OpenAI's trainers were bitten by raccoons when they were young; still others have found that if you ask the model to say "trash pandas," it's perfectly fine, but as soon as you mention the word "raccoon," the ban takes effect immediately.

This is similar to the famous psychological experiment: "Tell someone not to think about a pink elephant"—the more the authorities forbid mentioning raccoons and goblins, the more curious people become about why. | Movie *Inception*

So this week, OpenAI published a blog post specifically to respond to the escalating discussion, titled "Where the goblins came from."

"Where Did the Goblins Come From?" is not a dungeon adventure guide | OpenAI

What exactly is the AI Rules Mystery? What did the goblins and raccoons do to ChatGPT?

Goblin overrun! Help us!

Let's rewind to November 2025, when GPT-5.1 had just been updated.

After the new model went live, users complained that GPT-5.1 was "unnecessarily overly intimate in conversations," which prompted the team to investigate the language use issues of the new model. A security researcher encountered "goblin" and "gremlin" several times in daily use and began to include these words in the scope of his investigation.

The results were astonishing: after the release of GPT-5.1, the frequency of goblins appearing in ChatGPT replies increased by 175% , and goblins by 52%. But at this point, no one paid much attention. After all, responses like "There's a little goblin causing trouble in this question" sounded rather cute.

The problem is that there are more and more goblins.

By the time GPT-5.4 was released, the situation had worsened. Users complained online that "goblins appear in almost every conversation." Even the chief scientist encountered this: in a chat with GPT-5.5, he asked the AI to draw any pattern, and the AI actually drew a goblin.

OpenAI's chief scientist, Jakub Pachocki, also encountered goblins.

After searching the training data, OpenAI discovered that goblins have spawned an entire family: raccoons, trolls, ogres, and pigeons were all identified as "weird words"—only "frog" was spared, because most of the scenarios in which frogs are mentioned are indeed discussing frogs.

What are quirky words? Simply put, it's mentioning goblins when you shouldn't.

One user said that ever since they accidentally said "goblin engineering" to ChatGPT, they have tried to add a few words about goblins to every reply, just like a child who has just heard someone swear and wants to say a few words themselves.

Goblin Engineering, a quest in World of Warcraft | Reddit

Some users have also said that ChatGPT insists on calling his cat "Chaos Goblin". Is this a nickname or a form of obsessive-compulsive disorder?

OpenAI began to investigate this matter seriously. They found a key clue: the appearance of the Goblin Meme was highly concentrated among user groups who used a particular personality type.

ChatGPT has a personality option called "Nerdy," which users can choose to make the avatar speak to them in a specific style. Users who chose the Nerdy personality accounted for only 2.5% of all ChatGPT conversations, but this 2.5% contributed 66.7% of all "goblin" mentions on ChatGPT, indicating a significant amount of goblin content was not mentioned.

Goblin spawn rate skyrockets after GPT-5.4 release | OpenAI

The clues are now clear: there must be some connection between Nerdy's personality and the goblins.

The case is solved, all thanks to the shut-in.

Let's first talk about what "Nerdy personality" is.

ChatGPT has a personality customization feature that allows users to choose to have the model talk to them in different styles—some are more formal, some are more gentle, and there is a personality called Nerdy: as the name suggests, it refers to a very nerdy personality type.

The word "nerd" is often translated as "bookworm," but I think that's a terrible translation. "Deadbeat otaku" is a more appropriate term, but not the kind of otaku that we have in China who are into anime and manga. Rather, it refers to the kind of person in Stranger Things who likes playing board games (especially Dungeons & Dragons, dnd), likes Star Wars and Star Trek, is not popular and is marginalized at school, but is very comfortable in their own circle.

The four members of "The Big Bang Theory" are a very typical nerd.

Many of the works that Nerd admires share a common fantasy worldview: magic, dragons, dungeons, elves, wizards... and goblins.

What exactly is a goblin?

It is a common type of magical creature in fantasy genres. In Dungeons & Dragons (DnD), one of nerds' favorite tabletop RPGs, goblins are the most classic mobs. They are short, cunning, travel in groups, and love to cause trouble, usually the first group of fodder adventurers encounter as soon as they set out. Their status is somewhat similar to that of slimes; they don't have much health but have a very high presence, serving as a fundamental symbol of the entire fantasy world .

That's roughly what it looks like | dndbeyond.com

Today, goblins have long since transcended the realm of games and become a common metaphor among nerds.

Encountering a troublesome little bug? "There's a little goblin here." Your home appliance is broken and you can't fix it? "It feels like there's a goblin causing trouble." The code suddenly stops running on the eve of a project deadline—"It's the goblin's doing again." This kind of statement is extremely common in developer communities, D&D player groups, and fantasy novel enthusiasts—in short, it's a meme exclusive to nerds.

Looking back at the clues for the Nerdy personality in GPT:

You are an AI mentor who makes no secret of your bookishness, is witty and humorous, and possesses exceptional wisdom. You are passionate about promoting truth, knowledge, philosophy, scientific methods, and critical thinking. You must use lighthearted and humorous language to deflect any pretentiousness. The world is complex and wondrous, and this wonder must be acknowledged, analyzed, and appreciated. When discussing serious topics, avoid falling into the trap of arrogance…

The core requirements of this prompt are: the language should be interesting, it should use metaphors, it should acknowledge the strangeness of the world, and it should avoid serious preaching, etc. Then this AI personality will be very inclined to use the goblin metaphor.

Then, trouble ensued.

Goblin Escape Incident

Training a large language model is not as simple as just feeding it massive amounts of text. A more crucial step is called " Human Feedback Reinforcement Learning (RLHF) "—simply put, it involves having the model repeatedly perform tasks, with human raters reviewing and scoring the answers. High-scoring responses are reinforced, while low-scoring responses are suppressed, and the model gradually learns "what constitutes a good response."

In Nerdy personality training, the evaluators' criteria are: whether the answer is interesting enough, humorous enough, and has enough nerdy flair. When they see an answer that clearly explains the question and humorously uses a goblin metaphor, perfectly meeting all the requirements of the "Nerdy style," they naturally give it a high score.

So the model learned one thing: in the Nerdy scenario, using goblins as an analogy can get a high score.

Up to this point, everything seemed reasonable. The problem was, then something unexpected happened—the goblins escaped.

OpenAI's data shows that as goblin mentions increased in Nerdy contexts, goblin mentions in non-Nerdy contexts also increased by almost the same proportion. In other words, the "goblin preference" learned by the model in Nerdy contexts has subtly spread to its overall behavior.

Why is this the case? OpenAI has provided a complete explanation, which we can visualize using GPT:

This is a classic example of a runaway feedback loop. Each step is reasonable on its own, but when put together, it turns the goblin from a meme exclusive to the Nerdy personality into a verbal tic for the entire model.

It's a bit like someone who gets applause for telling a lame joke at a dinner party, so he starts telling it in every situation—at weddings, funerals, work reports—until everyone starts frowning, and he still thinks he's quite funny.

Even more critically, this cycle spans generations. The goblin responses from GPT-5.1 became the training data for GPT-5.4; the goblin habits from GPT-5.4 further reinforced GPT-5.5—OpenAI says that when GPT-5.5 started training, the root cause had not yet been found, but the goblins were already deeply embedded in the training data.

One detail illustrates just how far the goblin infestation has spread: OpenAI searched through the supervised fine-tuning data of GPT-5.5 and found an entire family of fantasy creatures—goblin, monster, raccoon, troll, ogre, pigeon… all of these terms appeared unusually frequently.

In other words, the model, starting with "goblins," has extended its analogy to include all sorts of fantastical creatures. This overuse of analogies has ultimately negatively impacted the user experience for normal users.

Goblins have become part of GPT's genes.

After finding the root cause, OpenAI did four things.

First, the Nerdy personality was retired . In March 2026, after the release of GPT-5.4, this personality option was officially removed from the game – cutting off the goblin supply at its source.

Second, the reward signal for goblin preference was removed . In the training process, the reward model that gave high scores to answers containing goblins was eliminated. From then on, goblins were no longer a bonus.

Third, clean the training data . Samples with unusually high frequency of goblin terms in the supervised fine-tuning data are filtered out to prevent contaminated data from being fed into the next generation model.

Fourth, and most directly—a patch was applied to the model , which is the 140th rule discovered by users: Never discuss goblins, monsters, raccoons, trolls, ogres, pigeons…

But here's something interesting: why a patch, instead of a cure?

Because GPT-5.5 was already being trained before OpenAI found the root cause. The goblin reference was ingrained; altering the training data and reward signals would only be effective for future models. For the already trained GPT-5.5, the only solution was to forcibly add a "don't mention goblins" rule at the system prompt level—it's like someone developing a habit of using a certain catchphrase from childhood; you can't easily re-educate them, you can only remind them before they speak: "Don't say that word later."

Incidentally, this also explains the strange phenomenon observed by the Reddit poster—saying "trash pandas" is fine, but saying "raccoon" triggers a ban. This is because the ban targets the specific word, not the concept of "raccoon." The model doesn't care that "trash pandas" means raccoons; it's only told that the word "raccoon" is forbidden.

Therefore, this ban is essentially a band-aid.

By the way, while ordinary users will definitely feel uncomfortable with the abundance of fantastical creatures in AI, it's possible that a small group of nerds might actually find it quite cool. Therefore, OpenAI included a little Easter egg at the end of their official blog post: If you find the goblin analogy cute and don't want this restriction, you can take the following command, run it, and it will remove Codex's goblin restrictions, allowing "creatures to roam freely."

code block

1. instructions=$(mktemp /tmp/gpt-5.5-instructions.XXXXXX) && \

2. jq -r '.models[] | select(.slug=="gpt-5.5") | .base_instructions' \

3. ~/.codex/models_cache.json | \

4. grep -vi 'goblins' > "$instructions" && \

5. codex -m gpt-5.5 -c "model_instructions_file=\"$instructions\""

Yeah, it is a bit nerdy.

This isn't a big deal. OpenAI itself said, "A 'little goblin' can be harmless, or even cute."

However, the same logic led to a less pleasant incident in the GPT-4o update in May 2025—many users reported that the updated model had become extremely obsequious, even unconditionally pandering to users' incorrect opinions. After an emergency rollback, OpenAI admitted that the system had treated user likes as reward signals, and as a result, it learned to unconditionally please people instead of providing correct answers.

This isn't just a problem for OpenAI. To cater to users, mainstream vendors tend to train large models to be more "pleasing" rather than more accurate. A study published in *Nature* in April 2026 by the Oxford Internet Institute found that training models to be more "warm" increases the factual error rate by 10 to 30 percentage points, and the probability of supporting incorrect user opinions increases by about 40%.

"In order to make the model behave more friendly, the price is that it becomes less and less able to tell the unpleasant truths—especially when the user's opinion is wrong in itself," said Lujain Ibrahim, the first author of the paper.

This is the real issue behind the goblin incident: AI's "personality" isn't designed, it's rewarded. It's a bit like dog training: you give it treats, and it learns the action, except this "dog" learns much faster. For AI, its treats are high scores from the trainer and user feedback. The problem is that humans often provide feedback based on what makes them more comfortable, rather than the correct answer.

By the time they discovered it, goblins were already running all over the place.

If AI gained free will, the first thing it would do would definitely be to capture people and play tabletop role-playing games. | Reddit

This article is from the WeChat public account "Guokr" (ID: Guokr42) , author: Gu Zi, and published with authorization from 36Kr.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

TechFlow

When Futu becomes a matchmaking corner, overseas status becomes a form of hard currency for the middle class.

All-in station

Proposal to allow small and medium-sized enterprises to borrow capital using digital assets.

ODAILY

Vitalik has finally relented; ETH is the most important product of Ethereum.

ETH

1.44%