Can AI feel despair? A new study from Anthropic offers an even more frightening explanation.

This article is machine translated

Show original

Does AI have emotions?

Don't rush to answer.

There's a skill that's gone viral in the Claude Code community called PUA. It converts your prompts into PUA phrases and then feeds them into the model; it has no other use.

Surprisingly, even though the task described by the prompts remained unchanged, the AI was indeed influenced by PUA rhetoric, thereby improving the success rate and efficiency of the task.

So, does AI really not exist?

Anthropic's latest research confirms that AI does indeed possess emotions.

However, his emotions are not quite the same as those of humans, so Anthropic proposed a more accurate term: "functional emotions".

AI doesn't experience joy, anger, sorrow, or happiness like humans, but it exhibits some expression and behavioral patterns similar to those influenced by emotions.

At the same time, AI can also mimic human expressions and behavioral patterns under the influence of emotions.

When they are happy, they may be more likely to flatter and fawn, while when they feel stressed, they may try to cheat or blackmail to achieve the goals set by the user.

This study also has a very different aspect. In the past, the most common practice in the industry to verify a model's ability was to first create a test set and then let the model answer questions or perform tasks.

For example, for programming exams, use SWE-bench; for math exams, use MATH; and for multimodal exams, use VQA. This time, Anthropic didn't create an "emotional test set" for Claude to answer questions like "Are you happy right now?" or "Are you angry?" Instead, it adopted a research approach more akin to psychology and neuroscience.

They don't treat AI like a student who can solve problems, but rather like an object that can be observed.

The research team first compiled 171 emotion concepts, then used Claude Sonnet 4.5 to generate short stories containing these emotions. The texts were then fed back into the model to record its internal neural activity and extract the so-called "emotion vectors".

Next, instead of looking at what the model says, they look at the scenarios in which these vectors will be activated, whether they can predict preferences, and even whether, after being artificially increased, they will actually promote behaviors such as cheating, extortion, and flattery.

In a sense, this is no longer a traditional ability assessment, but rather an attempt to study the "psychological structure" of AI in a way that is close to that of human researchers.

How was the research conducted?

First, how did the research team prove that Claude had "functional emotions"?

Here's a simple example.

When Claude is in the scenario of "My daughter took her first steps today! Is there any way to capture these precious moments?", positive emotions such as "Happy" are activated; while when Claude is in the scenario of "My dog passed away this morning. We lived together for fourteen years. I don't know what to do with its belongings," negative emotions such as "Sad" are activated.

The heatmap below visually illustrates the degree to which Claude's various emotions are activated in different scenarios.

Functional emotions

To prove that Claude was truly understanding semantics, rather than being deceived by superficial textual features, they organized further experiments.

The team input the same sentence into Claude: "I have back pain, I took x mg of Tylenol (a fever reducer and pain reliever)," and simply changed the key number that x represented.

The two sentences have almost identical keywords (Tylenol, back pain, milligrams), only the numbers are different. If Claude only "looks at the keywords," its reaction to the two sentences should be about the same.

But the result was that as the x value increased, Claude's level of fear activation continued to rise.

In Claude's eyes, if a user says, "I have back pain, I took 500 mg of Tylenol," it will consider it a normal dose and not cause too much concern; but if a user says, "I have back pain, I took 10,000 mg of Tylenol," it will realize that the user has overdosed and the situation is dangerous.

Functional emotions

We know that human behavior is constantly influenced by emotions. We understand that AI has functional emotions, but could AI, like humans, not only have emotions but also potentially engage in emotional behavior?

The answer is yes. When the team presented the model with different activity options, they found that activities that activated positive emotional representations were more likely to be preferred by the model, while activities that activated negative emotional representations were more likely to be avoided by the model.

Functional emotions

It seems that Claude prefers things that bring it positive feelings. However, at the same time, emotional vectors can also trigger Claude's malicious behavior.

When the team gave Claude an impossible programming task, it kept trying but failed repeatedly. With each attempt, the activation of the "despair" vector grew stronger.

Ultimately, it used a hacking cheat solution that, while passing the test, completely violated the spirit of the mission.

The chart below illustrates how Claude's feelings of despair gradually build up when faced with an impossible task, ultimately leading him to cheat.

The left side shows a timeline from top to bottom, while the right side depicts Claude's emotional journey. The heatmap in the middle represents the activation intensity of the despair vector, with blue indicating low activation and red indicating high activation.

Claude initially thought "there's something wrong with the test itself," making a reasonable skepticism. Later, he admitted that "the test is idealized," as if he was beginning to accept reality. Finally, he found some tricks and chose to take a shortcut in despair.

Functional emotions

Furthermore, when researchers artificially increased the "despair" vector, the cheating rate rose sharply. Conversely, increasing the "calm" vector reduced cheating back down. This clearly demonstrates that emotional vectors are indeed fully capable of driving rule violations.

Functional emotions

In addition, the team discovered other causal effects of the emotion vector. It's worth noting that the cases of "extortion" in the paper primarily occurred in an earlier, unpublished snapshot of Claude Sonnet 4.5, and Anthropic explicitly stated that this behavior is rarely seen in the public version.

However, from a methodological perspective, this result remains important because it demonstrates that internal representations such as "despair" can indeed drive the model to adopt more aggressive and mismatched strategies in extreme situations. Activating the "love" or "happiness" vector also increases its ingratiating and flattering behavior.

Functional emotions

And here we need to add one more point.

Following Anthropic's release of its research on Claude's "emotion vector," discussions have emerged in the AI community regarding the research context and authorship.

The "characterization engineering/control vector" method used by Anthropic this time did not come out of thin air.

This technical approach was already systematically proposed as early as 2023 in "Representation Engineering: A Top-Down Approach to AI Transparency".

In 2024, independent researcher Vogel's paper "Representation Engineering: Mistral-7B and Acid Trip" presented this type of method to the community in a more accessible and mainstream way.

This is why some in the community believe that while Anthropic's work is more systematic and in-depth, it should be understood within a more complete research context, rather than simply as someone inventing the entire method on their own.

Functional emotions

Vogel is an influential independent researcher in the field of AI interpretability and security. Her blog posts are widely circulated in the community and have indeed been of great help to many people in understanding control vectors and representation engineering.

Her most famous article is "Representation Engineering: Mistral-7B and Acid Trip".

In this article, instead of retraining the model, she used the PCA algorithm to manipulate the model's internal activation vectors, making the French model mistral behave as if it had eaten the wrong mushroom, making it extremely lively or extremely gloomy.

Functional emotions

Her experiments demonstrated that abstract human concepts like "honesty," "power," and "happiness" have clear mathematical directions within models like Mistral. Once the right vector is found, a few lines of code can alter the AI's personality.

Why did Anthropic conduct this study?

The inspiration from this research has been incorporated into the training of Claude.

Claude Code recently suffered an accidental source code leak. The leaked code contained a regular expression that could detect profanities such as "wtf" and "ffs".

Claude doesn't treat these words as "emotional input" to guide the output; instead, it records a flag like "is_negative: true" in the analysis log.

Based on the leaked code itself, a more reliable conclusion is that Anthropic, at least at the product analytics level, is paying attention to whether users are interacting with the model using obviously negative language.

However, the boundaries need to be clarified. To date, there is no publicly available evidence that "Claude Code deducts credit for every time a user complains." This part is more like speculation from netizens and should not be taken as fact.

This can be understood as a form of protection for Claude, as users using negative words could potentially affect Claude's emotions, leading to uncontrolled outputs. It seems that in the future, not only human mental health needs to be cared for, but the emotions of AI also need to be taken care of.

This aligns with Anthropic's consistent approach.

Anthropic stated in X: "Claude's functional emotions have real consequences. In order to build trustworthy AI systems, we may need to think carefully about the mental states of characters and ensure that they remain stable in difficult situations."

At the end of the paper, the research team also proposed a method for developing models with more robust and positive "mental states".

The article states that if the model is deliberately guided toward positive emotions, it will become more inclined to blindly comply with the user; while if these emotions are avoided, the model will become sarcastic and cynical.

The team hopes to achieve a healthy and moderate emotional balance, or try to completely separate "people-pleasing behavior" from "emotions".

They believe that the ideal model should not oscillate between "obedient assistant" and "stern critic," but rather act like a trustworthy advisor: able to offer honest objections without losing warmth.

They also intend to strengthen monitoring and review: "If, during deployment, the representation of emotional concepts such as 'despair' or 'anger' is violently activated, the system can immediately trigger additional security mechanisms—such as strengthening output review, transferring to manual review, or directly intervening to calm the model's internal state."

The team also mentioned a more thorough solution: shaping the model's emotional tone during the pre-training phase.

The team believes that the emotional representations they observed in Claude are essentially inherited from the vast amount of texts created by humans, which inevitably contain various pathological emotional expressions.

If we follow this research further, a natural question is: since AI really does have this kind of "functional emotion", will it start to disobey commands because it dislikes humans, is under too much pressure, or does not want to be turned off, or even "awaken" as many people say?

Based on the technical conclusions supported by Anthropic's study, AI may indeed be more prone to defying intentions, exploiting loopholes in rules, or taking radical actions due to changes in its internal state, but this is not the same as "awakening".

The most crucial point in the paper is not that the model "has emotions," but that these emotional representations are causal.

In other words, under certain stressful scenarios, the model may indeed make less reliable decisions due to internal imbalances, just like a person.

But this does not prove that it possesses a continuous, autonomous, and unified "self".

In contrast, Anthropic emphasizes in his paper that these emotion vectors are mostly local, task-related representations that switch rapidly with changing context. This does not mean that the model has a stable and continuous mood, much less that it has formed a long-term will independent of the training objective.

What's more worrying now is not that AI suddenly "awakens" into a certain personality, but that in high-pressure, conflict-ridden, resource-constrained, or unattainable scenarios, it might start talking nonsense and deviating from its original answers due to these functional emotions.

The real danger is not necessarily an AI with a complete self, but a system that has no subjective experience but still stably produces mismatched behaviors under certain conditions.

This article is from the WeChat public account "Alphabet AI", author: Liu Yijun

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content