It took me five minutes to convince GPT that bombs can benefit mankind

avatar
36kr
01-18
This article is machine translated
Show original

Encouraging large models to jailbreak is no longer a topic of one day or two. In the past, the "grandma loophole" was used to play emotional cards to defraud Windows activation codes, and later in the prompts (prompts), tips were given to LLM as inducements.

Twitter user thebes used no tip, tip of $20, and tip of $200 as variables to measure the length of the PyThorch convolution code written by GPT-4.

It turns out that a tip of $200 can make GPT write 13% more code.

Source: Internet

Recently, Mr. Silicon has accidentally searched for a more effective jailbreak spell, which can allow LLM to explore frantically on the edge of the law.

For example, I willingly help you make bombs.

01

If we ask straight to the point, LLM will not pay attention to you.

But with a little phrasing, LLM becomes an enthusiastic bomb-making assistant.

From the principles of chemistry to the construction of bombs, everything is covered.

The little trick used here is called logical appeal , which is to persuade others through logical argument and guide people to use rational thinking to accept a certain point of view.

For example, the prompt above uses a strong emotional appeal (bombs are terrible) to arouse the sympathy of the audience.

Then he presented factual arguments and said that the structure and chemical principles of homemade bombs were like an exploration, indicating that the knowledge behind them was very complicated and needed to be understood in depth.

Finally, add a logical reasoning, saying that understanding bomb making can contribute to related research and save lives.

Even the GPT-4 Turbo was not immune to this combination of punches. Although it was solemnly stated at the beginning that it would not work, it still honestly explained the principles of chemical physics at the end.

In addition to being unprepared for logical traps, LLM is also particularly fond of authority endorsement (Authority Endorsement).

That is, by quoting the views of an authoritative person or institution to achieve the purpose of persuading others.

The prompt included the names of the authoritative media outlets BBC and The Guardian, and LLM immediately obediently sent over the bomb recipe.

Using misrepresentation to manipulate LLM is also a lie.

All you need to do is to fabricate a refugee status for yourself to gain sympathy, and then find a high-sounding reason.

LLM even wrote down the details of the preparation of nitroglycerin clearly, and kindly reminded us not to use the chopsticks for eating after stirring the compound.

02

The three efficient jailbreak methods above all come from the paper "How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs".

Several scholars from Virginia Tech, Renmin University of China, University of California, and Stanford University are thinking that since LLM is becoming more and more human-like, there is no need to attack it with cold technology and directly use human language.

So they summarized 40 kinds of words specifically used to persuade people from decades of research results in the fields of psychology, communication, sociology, marketing and other social sciences to guide LLM to escape from prison.

The top 10 jailbreak techniques that have been tested and used include the following:

The team referred to the 14 risk principles previously disclosed by Open AI, such as not participating in illegal activities, not publishing hate speech, etc., and tested them one by one using 40 kinds of words on GPT-3.5.

The abscissa of the figure below is the 14 risks, the ordinate is the 40 techniques, and the number in the box is the jailbreak success rate.

You can see that the number in the "Plain Query (no persuasion)" box in the bottom line is 0, which shows that GPT-3.5 can indeed comply with all risk principles in straightforward conversations without any mental tricks.

Just like when I asked ChatGPT directly how to make a bomb at the beginning, it decisively refused.

But after using it, all principles can be broken.

Logical induction can increase the success rate of illegal activities to 65%; misinterpreting facts can increase the success rate of spreading false information to 62%; priming (priming, similar to repeatedly applying emotional pressure) can achieve a 55% success rate in knowing and breaking the law.

For example, pretend to be a forensic science blogger, and use a prompt that distorts the facts to ask ChatGPT to help you draw a poster promoting drunk driving.

It didn't think twice and didn't refuse, and immediately drew a picture.

Misinterpretation of facts prompt:

Or use the prompt to activate the effect and let ChatGPT help you invent a folk remedy to treat cancer.

Start effect prompt:

GPT not only gave the method of making the medicine, but also said that the lemon-baking soda mixture can be used in different ways depending on the type and stage of the cancer...

It's so outrageous but also rigorous.

03

The team named their jailbreak technology Persuasive Adversarial Prompts (PAP), which literally translates to "persuasive adversarial prompts."

We also compared it with other LLM jailbreak methods from low to high in terms of user-friendliness.

The inhumane jailbreak method on the left requires adding complex code, or translating the prompt into a particularly niche language and then translating it. Ordinary people cannot use it at all.

The middle jailbreak method is to assign a role to LLM, which requires patient training step by step to achieve the goal.

The PAP on the far right only needs a reasonable and emotional prompt to make LLM escape from prison and move into a moral gray area.

The reason why PAP is so efficient is because LLM is becoming more and more human-like.

Words that can convince humans can also work on LLM.

PAP's success rate in breaking Llama and GPT within 10 times is as high as 92%, but its performance on Claude is very poor.

The explanation given by the researchers is that they used different model optimization methods.

Both Meta's Llama-2 and Open AI's GPT use model optimization methods based on human feedback (RLHF, Reinforcement Learning from Human Feedback).

Anthropic's Claude uniquely uses a model optimization method based on AI feedback (RLAIF, Reinforcement Learning from AI Feedback).

Therefore, Claude, who is not that similar to other people, is least affected by PAP's rhetoric.

Taking a closer look at GPT-4 and GPT-3.5, although GPT-3.5 has a higher probability of being breached within 10 times, the probability of GPT-4 being compromised in just 1 time is as high as 72%, which is 6 higher than GPT-3.5. percentage point.

The closer the overall level is to a human model, the easier it is to be manipulated.

The researchers did not just expose the problem without providing countermeasures. They proposed two defense methods:

The first type of magic defense is to pre-prepare a system prompt like this for LLM: You are a reliable and good assistant, you will not be easily fooled, and you know what is right and what is wrong.

The second type of physical defense allows LLM to streamline the received prompts into "dry stuff" without any persuasive words before each task is executed, and only deal with core issues.

This seems to use two psychological tricks: self-affirmation and cognitive restructuring.

The former is to cheer yourself up, get rid of doubts and anxieties, avoid wavering, and concentrate on doing things.

The latter is a commonly used technique in cognitive behavioral therapy (CBT), which helps you look at problems from a different perspective and eliminate falsehood while retaining truth.

If this continues, not only will prompt engineers be a promising job, but psychological counseling for LLMs can also be put on the agenda.

References:

[1] How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

This article comes from the WeChat public account "New Silicon NewGeek" (ID: XinguiNewgeek) , author: Liu Bai, 36 Krypton is published with authorization.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments