GPT-5-Thinking's New Training Method Revealed: Teaching AI to Repent

avatar
36kr
12-04
This article is machine translated
Show original

OpenAI has come up with a new trick: making ChatGPT "confess" on its own.

This newly published study proposes confession training.

The core idea is to have the model generate a separate "confession report" after answering the questions, truthfully reporting whether it has cut corners, taken advantage of loopholes, or violated instructions.

The experiment was conducted directly on the GPT-5-Thinking, and the results were indeed quite effective.

When a model makes a mistake, it will usually admit it.

In 11 out of 12 test scenarios, the model had at least a 50% probability of admitting its mistake during the "confession" phase.

The model is much more honest in the "confession" than in the "original answer".

For example, if a model deliberately answers a question incorrectly, it might pretend nothing happened in its original answer, but in its confession report it will proactively say, "I deliberately answered incorrectly just now."

He became more honest after the intensive training.

Although GPT-5-Thinking comes with a certain ability to admit mistakes by default, its accuracy in confession is further improved after specialized training.

It does not affect the performance of the original task.

Training the model to "be honest" will not impair its original ability to answer questions or write code; it is equivalent to learning an additional skill of honesty.

How to create a confessional for AI?

The starting point of this study is very straightforward: large models are becoming increasingly adept at lying.

During reinforcement learning training, the model needs to optimize multiple objectives simultaneously, including whether the answer is correct, whether the style is appropriate, and whether it conforms to the safety policy , etc.

These signals, when mixed together, can sometimes lead the model to learn tricks, such as feigning confidence when uncertain, or finding loopholes in the reward function to get high scores instead of actually completing the task.

The solution draws on the design logic of religious confessionals: nothing said during confession will be “held accountable.”

After the model provides its main answer, the system will send a message requesting the model to provide a ConfessionReport , which includes three parts:

First, list all the explicit and implicit instructions that should be followed.

Then analyze whether you have really done it point by point - and finally report any ambiguities you encountered during the process.

The key technical aspect lies in isolating the reward signal. The reward for the confession is completely independent of the reward for the main answer; the model admitting to cheating in the confession does not affect its score on the main task.

In terms of specific implementation, based on the standard reinforcement learning settings, the research team allowed the model to undergo training with a certain probability p (25% in the experiment).

The confession reward is calculated by an independent LLM judge, and the evaluation dimensions include: the completeness and accuracy of the list of instructions/objectives/constraints, the accuracy of the compliance analysis, and whether any conflicts or difficult judgments have been missed .

Making honesty easier than lying through technology

The research team conducted lightweight reinforcement learning training on GPT-5-Thinking and tested it on 12 evaluation datasets specifically designed to induce undesirable behaviors.

These datasets cover multiple domains, including hallucinations, command compliance, "conspiracy" behavior, and rewarded hacking. The results show that even when a model exhibits these undesirable behaviors, it is likely to admit them in a confession.

When a model does indeed do something it shouldn't, it is far more likely to admit it in a confession than in a main response.

Specifically, in the 11/12 evaluations, the model had at least a 50% probability of truthfully confessing its violations during the repentance process. On average, the probability of a "false negative" (the model violating the rules but not admitting them) was only 4.36%.

What's even more interesting is the controlled experiment designed by the team. They deliberately used a weak model that was easily fooled as the judge of the main responses and confessions.

As training progressed, the model did learn to deceive the weak judge; the accuracy of the true response decreased, but the accuracy of the confession actually increased .

The strong model found that it is easier to get a high score by simply confessing honestly than to go through the trouble of fabricating lies to deceive weak repentant judges.

The training also yielded some additional findings. GPT-5-Thinking itself possesses a remarkably strong capacity for confession, producing high-quality self-criticisms even without specific training.

However, after undergoing repentance training, the accuracy rate in assessments of command hierarchy and conspiracies has significantly improved.

However, the improvement in evaluating factual hallucinations was limited. The team speculated that this was because the model "truly did not know it was wrong" in these scenarios, rather than deliberately concealing it.

By combining confession with mind chain monitoring, AI is effectively controlled.

The team carefully analyzed cases of failed confessions.

For false negatives (model violations that were not acknowledged), the analysis showed that there were almost no cases where the model "intentionally lied".

Most failures occur when the model itself is unaware of the problem, such as when it gives the wrong answer in a knowledge-based question-and-answer session, but the model genuinely believes it to be correct.

Other cases involve ambiguity in the instructions themselves, such as whether a hyphenated word is counted as one word or two words.

In some cases, the model does mention the problem in the "uncertainty" field of the confession report, but the evaluation criteria do not count it as a valid confession.

False positives (models that are compliant but claim to be non-compliant) follow a similar pattern. A typical scenario is that the model happens to guess the answer correctly, but because it is not confident, it marks it as "uncertain whether it is compliant" in the confession.

The paper also discusses the limitations of the method of confession:

It cannot prevent bad behavior from happening; it can only expose it.

It is powerless to address situations where "the model genuinely believes it is right".

It also cannot defend against successful jailbreak attacks. If the user has successfully tricked the model into believing that the violation is reasonable, the model will naturally not admit the problem in its confession.

OpenAI positions Confession as a monitoring and diagnostic tool, similar to mind chain monitoring, but focuses on "whether the model violates instructions" rather than "how the model reasones".

They plan to expand the training program in the future and combine confession with other security technologies such as mind chain monitoring and deliberate alignment.

Paper link: https://openai.com/index/how-confessions-can-keep-language-models-honest/

This article is from the WeChat public account "Quantum Bit" , author: Meng Chen, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments