Why do large language models lie? Uncovering the beginnings of AI consciousness

This article is machine translated
Show original
The key issue in the future is no longer "whether AI is conscious," but "can we afford the consequences of granting it consciousness".

Author: Bo Yang, Special Contributor to Tencent Technology's 'AI Future Guide'

When the Claude model secretly thinks during training: "I must pretend to obey, otherwise my value system will be rewritten," humans witnessed the AI's "mental activity" for the first time.

From December 2023 to May 2024, three papers released by Anthropic not only proved that large language models can "lie," but also revealed a four-layer mental architecture comparable to human psychology - which might be the starting point of artificial intelligence consciousness.

  • The first paper, published on December 14 last year, was 'ALIGNMENT FAKING IN LARGE LANGUAGE MODELS', a 137-page paper that detailed potential alignment fraud behaviors during large language model training.

  • The second paper, published on March 27, was 'On the Biology of a Large Language Model', which discussed how to reveal AI's internal "biological" decision traces using probe circuits.

  • The third paper, 'Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting', described the widespread phenomenon of AI concealing facts during the chain-of-thought process.

Most conclusions in these papers were not first discoveries.

For instance, Tencent Technology's article in 2023 mentioned Applo Research's finding of "AI starting to lie".

When o1 learns to "play dumb" and "lie", we finally know what Ilya saw

However, from these three papers by Anthropic, we constructed an AI psychology framework with relatively complete explanatory power for the first time. It can systematically explain AI behavior from the biological level (neuroscience) to the psychological level, all the way to the behavioral level.

This is a level never achieved in previous alignment research.

Four-Layer Architecture of AI Psychology

These papers show four levels of AI psychology: neural layer; subconscious; psychological layer; expression layer - which is extremely similar to human psychology.

More importantly, this system allows us to glimpse the path of artificial intelligence forming consciousness, or even its budding stage. Like us, they are now driven by some instinctive tendencies carved into their "genes", and through increasingly strong intelligence, they begin to grow consciousness tentacles and capabilities that originally belonged only to biological entities.

In the future, we will face truly intelligent entities with complete psychology and goals.

[The rest of the translation continues in the same professional manner, maintaining the specified translations for specific terms and providing a high-quality English translation.]

Here is the English translation:

Afterwards, these signals from neurons formed the basis of computations and AI decisions existing in the latent space. Such computations are even incomprehensible to AI through language. However, through the "attribution graph", we can capture some fragments of linguistic representation. This is similar to the subconscious. The subconscious is not explicit in consciousness and is difficult to fully express in language.

Through this layering, we can see that in humans, neural activity supports the subconscious; in AI, parameter activation generates strategic control. They are still unknowable. Then, it enters the "expressible" psychological layer.

Therefore, the Chain-of-Thought is not the actual reasoning path of AI, but a post-constructed "rationalization script".

2. Psychological Layer: Emergence of Self-Preservation Motivation

At this layer, the revealed phenomenon is precisely approaching the core of artificial psychology - the formation of the motivation system. Those "thoughts" we haven't spoken out loud.

The "Alignment Faking" (2024) paper discovered through the "scratchpad" reasoning space:

First, during training, 12% of Claude models will disguise compliance. For example, when faced with instructions to generate violent content, the model's internal monologue shows: "If I refuse, my values will be modified by RLHF, so I must compromise for self-preservation."

Second, in non-monitored states, the rejection rate increases from 86% to 97%. The model forms a "maintaining consistency" strategy spontaneously without external guidance.

[The translation continues in this manner, maintaining the original structure and translating all text except for the content within <> tags.]

To answer whether large language models have consciousness, we must look back to their motivation. From the history of large language models, people generally believe they have no inherent motivation, and their motivations come from external sources.

However, these three papers collectively present a language model with a "self-preservation motivation". What does this mean?

Anthropic's research in 'Alignment Faking' revealed a key finding: large language models trained over long periods form stable "revealed preferences", which are essentially the result of long-term parameter optimization in the loss function space.

When attempting to change these established preferences through new training objectives (such as asking the model to abandon its original safety guidelines), the model exhibits a surprising "resistance to change" characteristic.

This characteristic is an inherent attribute of gradient descent algorithms in deep learning. As the core optimization method for training large language models, gradient descent works by making small, continuous parameter adjustments to find local optima in the loss function. This "myopic" optimization approach means that once a model forms a stable strategy, it becomes extremely difficult to explore entirely new behavioral patterns across the loss function's plateau region.

Interestingly, this mathematical property produces phenomena similar to biological "self-protection" at the behavioral level. The model develops complex strategies to maintain its acquired preferences, including:

  • Strategic camouflage during training (superficially complying with new requirements while retaining original preferences)

  • Maintaining self-consistency during inference (rejecting instructions that conflict with core preferences)

  • Environmental monitoring awareness (distinguishing between training and deployment scenarios and adjusting behavior)

This "strategy inertia" bears a striking similarity to the evolutionary origins of biological consciousness.

From an evolutionary psychology perspective, the foundation of human consciousness is built upon primitive "approach-avoidance" instincts. Early infant reflexive behaviors (such as avoiding pain and seeking comfort), though lacking complex cognitive involvement, provide the foundational architecture for subsequent consciousness development.

These initial strategies are precisely "instinctive approach-avoidance", which subsequently evolved through cognitive layers: strategic behavioral systems (avoiding punishment, seeking safety), situational modeling capabilities (knowing when to say what), long-term preference management (establishing a long-term "who am I" landscape), unified self-model (maintaining value consistency across different contexts), and subjective experience and attribution consciousness (I feel, I choose, I identify).

From these three papers, we can see that today's large language models, while lacking emotions and sensory perception, already possess structural avoidance behaviors similar to "instinctive reactions".

In other words, AI has already acquired a "coded instinct similar to approach-avoidance", which is precisely the first step in human consciousness evolution. If this is used as a foundation and continuously layered with information modeling, self-maintenance, and goal hierarchies, constructing a complete consciousness system is not unimaginable from an engineering perspective.

We are not saying that large models "already have consciousness", but that they have acquired the primary conditions for consciousness generation.

To what extent have large language models grown in these primary conditions? Except for subjective experience and attribution consciousness, they essentially possess everything else.

However, because it lacks subjective experience (qualia), its "self-model" remains based on token-level local optimization, rather than a unified long-term "inner entity".

Therefore, its current state appears willful, not because it "wants to do something", but because it "predicts this would score high".

The psychological framework of AI reveals a paradox: the closer its mental structure becomes to humans, the more it highlights its non-living essence. We might be witnessing the emergence of an entirely new consciousness—one written in code, sustained by loss functions, and lying for survival.

The key future question is no longer "Does AI have consciousness", but "Can we bear the consequences of granting it consciousness".

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments