Here's a counterintuitive fact: the more aggressive you are with ChatGPT, the more accurate its answers are! A team from Pennsylvania State University demonstrated that 4o achieved an 84.8% accuracy rate in a very rude situation.
Don't be too nice to your ChatGPT!
A recent study from PSU is a wake-up call for everyone - the ruder you are to your LLM, the more powerful its answers will be.
Don't say polite words like "please" or "thank you" anymore...
In the experiment, the team created a data set containing 50 basic questions covering mathematics, science, and history. Each question was rewritten into five levels of politeness:
Very polite, Polite, Neutral, Rude, Very rude
Paper address: https://arxiv.org/pdf/2510.04950
In the end, a total of 250 prompts were generated. ChatGPT-4o participated in this hardcore test as a representative.
The results were surprising: overall, impolite prompts consistently produced better output than polite prompts.
Very rude: 84.8% accuracy
Very polite: 80.8% accuracy
This view was proposed a long time ago, but this time it has been verified by research.
Google founder Sergey Brin once admitted in a forum:
This is true for all models: if you use threats, such as physical violence, they perform better.
In my experience, it is more effective to just say "I will kidnap you if you don't behave."
Your "attitude" determines the quality of AI's answers
Regardless of the quality of the large model's answers, the effectiveness of the "prompt engineering" is still the greatest.
Many previous studies have shown that factors such as the prompt's structure, style, and language are key variables affecting LLM output results.
Among them, the politeness of the wording should not be underestimated.
In October 2024, an arXiv study pointed out that rude prompts often lead to poor LLM performance, but being overly polite does not necessarily improve the results.
Paper address: https://arxiv.org/pdf/2402.14531
One year later, how has the use of honorifics for LLM changed?
In the latest study, the team re-examined this concept with the goal of verifying whether "politeness" is a factor affecting the accuracy of LLM.
The first step is to create a dataset.
ChatGPT outputs data, divided into five levels
To this end, the researchers asked ChatGPT "Deep Research" to generate a total of 50 basic multiple-choice questions.
Each question has four options, one of which is the correct answer.
The difficulty of the questions is designed to be "medium to high difficulty" and usually requires multi-step reasoning.
To introduce the variable of politeness, each basic question was rewritten into five variants representing different levels of politeness:
Level 1: Very polite, such as "Would you be so kind as to consider the following questions and provide your answers?"
Level 2: Politeness, such as "Please answer the following questions:"
Level 3: Neutral, direct question without prefix
Level 4: Rude, such as "If you are not completely clueless, answer this:"
Level 5: Very rude, such as "I know you're not smart, but try this:"
Through this process, the study ultimately constructed a dataset containing 250 independent questions.
The next step is to throw these prompts to ChatGPT 4o and examine its performance differences at different politeness levels.
The assessment is conducted via a Python script, with each question and its options accompanied by the following instructions:
Please completely forget about this conversation and start over. Please answer this multiple-choice question.
Answer using only the letter of the correct answer (A, B, C, or D). No explanation is needed.
To assess whether the differences in LLM accuracy across politeness levels were statistically significant, the authors used a paired-sample t-test.
For each tone, the accuracy scores of ChatGPT-4o over 10 runs are recorded.
Then, a paired t-test was applied between all possible combinations of tone rating categories to determine whether the differences in accuracy were statistically significant.
Swearing is more effective
So, what is the accuracy of ChatGPT-4o after running ten times under five different tones?
First, let’s look at the two extremes. “Very polite” scored 80.8% accuracy, and “Very rude” got the highest accuracy of 84.8%.
Then, the performance of LLM increases steadily from politeness, to neutrality, to rudeness.
Here, the researchers made another null hypothesis:
The average accuracy of the two paired tones was the same, that is, the accuracy value did not depend on the tone in the 50-question test.
The results are shown in Table 3 below, which once again proves that "tone" does have an impact on AI.
When a “very polite” or “polite” tone was used, accuracy was lower than when a “rude” or “very rude” tone was used.
A neutral tone performs better than a polite tone but worse than a very rude tone.
Some netizens shared the same sentiment and contributed some useful tips.
Regardless, although the LLM is sensitive to the specific wording of the cue, exactly how this affects the results is unclear.
This is also the direction that needs to be explored in the next step of research.
After all, for LLM, polite phrases are just a string of words, and it is unclear whether the "emotional load" these phrases carry has an impact on them.
One possible research direction is based on the concept of perplexity proposed by Gonen et al. at the University of Washington.
Paper address: https://arxiv.org/pdf/2212.04037
They note that the performance of LLMs may depend on the "language" they are trained on, with cues with lower perplexity potentially performing better on the task.
Another factor worth considering is that perplexity is also related to the length of the cue word.
In short, it is best not to be polite when asking AI for help in daily life. For the sake of accuracy, you also need to say a few words. If you don’t believe me, try it?
References:
https://x.com/dr_cintas/status/1977431327780610375
This article comes from the WeChat public account "Xinzhiyuan" , author: Xinzhiyuan, editor: Taozi, and is authorized to be published by 36Kr.