Reverse and misplaced Turing test: GPT-4 is more "human" than humans

avatar
36kr
09-09
This article is machine translated
Show original

Researchers at the University of California explored the ability of humans and AI to distinguish between humans and AI when speaking to each other through reverse and misplaced Turing tests, but the results showed that neither humans nor current large language models can distinguish between the two without active interaction.

AI-generated content is gradually flooding the Internet.

People today are more likely to read and browse AI-generated texts than to have direct conversations with AI.

The classic Turing test gives judges a key advantage: they can adjust the questions in real time to conduct adversarial tests on participants.

But this isn’t always the case when passively consuming AI-generated text.

Therefore, researchers from the University of California, San Diego suggest that we need to conduct variations of the Turing test in environments that are closer to reality to determine how well people can distinguish between humans and AI in real-life scenarios.

Paper address: https://arxiv.org/pdf/2407.08853

And further clarify the following issues:

Can humans reliably distinguish humans from AI simply by observing conversations?

Can LLM serve as an AI detector not only for static texts like articles and paragraphs but also for dynamic conversations?

Does misaligning the Turing test improve or reduce accuracy?

Can the Reverse Turing Test reveal naive psychology in artificial systems?

And which methods are best suited for AI detection in real-world conversational settings?

The study will measure how well humans and large language models perform on this distinction using two variations of the Turing test: the inverted Turing test and the displaced Turing test.

Among them, GPT-3.5, GPT-4, and humans acting as judges judge whether the participants are humans or AI based on the conversation records of the Turing test.

The classic Turing test and its variants

Classic Turing Test

In the classic Turing test, a human judge engages in a plain text conversation with two participants, one human and the other a machine.

If the judges cannot accurately distinguish between humans and computers, then the computer passes the test and can be considered an intelligent agent.

Since Turing's original paper was published, the Turing test has sparked heated debate and played a key role in the understanding and construction of the modern concept of intelligence.

On the other hand, its validity or adequacy as a test of intelligence has been widely questioned.

Regardless of its effectiveness as a test of intelligence, the Turing test remains an important means of assessing the similarities between human and AI writing, and a powerful tool for studying AI deception.

There have been multiple attempts to pass the Turing test over the years, including the Loebner Prize competition between 1990 and 2020, but no system has ever passed the test.

“HumanorNot” is a large-scale social Turing test experiment that found that judges were 60% accurate; a 2024 study reported the first system whose pass rate was not statistically different from chance (54%) but was still below the human standard (67%).

There are many variations of the Turing test, each offering a different perspective on theory and practice.

Inverted Turing Test

The reverse Turing test is to let the AI system play the role of the judge.

In 1996, Watt proposed the reverse test as a measure of "naive psychology," the idea that humans have an innate tendency to recognize intelligence similar to our own and attribute it to other minds.

The test is passed if the AI system "cannot distinguish between two real people, or cannot distinguish between a human and a machine that passes a normal Turing test, but can distinguish between a human and a machine that can be distinguished in a normal Turing test with a human observer."

Watt believes that by having the AI act as an observer and comparing its judgments of different participants with the accuracy of humans, it could reveal whether the AI has a naive psychology similar to that of humans.

Displaced Turing Test

The misplaced Turing test involves having the judge read a transcript of an interactive Turing test previously conducted by another human or AI judge to evaluate the AI's performance.

The new judges were described as “misplaced” because they were “on the outside” and had not participated in any interaction with the AI.

This is a new static Turing test because the judgment is based on pre-existing and unchanging content, whether generated by humans or AI.

Secondly, in the traditional Turing test, interactive judges can ask dynamic, flexible and adversarial questions, while judges in the static Turing test can only make judgments based on the content of the conversation and cannot deeply explore the most effective inquiry direction through interaction.

Static tests have certain limitations in assessing model capabilities, but they are likely to reflect common situations in the real world, as many interactions are read by a wider audience than just the direct parties involved in the conversation.

Statistical methods for detecting AI-generated content

There are currently a variety of statistical methods to detect AI-generated content. These methods are mainly based on the principle that large language models (LLMs) generate content by sampling the probability distribution of words:

This may leave specific probabilistic features, such as LLM-generated content being statistically more likely to appear than human-generated content.

In 2023, Mitchell et al. developed a related metric called “curvature”, which measures the local optimality of a piece of text by using small perturbations generated by a masked language model. The content generated by LLM is more likely to be optimal relative to nearby perturbations.

In 2024, Mireshghallah et al. found that smaller LLMs tend to be better detection models, among which the OPT model with 125m parameters performed best in detecting AI-generated content, achieving 90% accuracy in detecting GPT-4 content.

Various commercial tools have been developed based on these and other computational approaches to text classification. Studies have evaluated the effectiveness of these methods in different settings, with mixed results: while accuracy is high in some domains, the models are often unreliable and may produce false positives.

Reverse Turing Test

In the reverse Turing test, GPT-3.5 and GPT-4 play the role of adjudicators, evaluating transcripts of previous Turing tests to determine whether the participants were human or AI.

The study set out five hypotheses:

1. Is GPT-4’s accuracy significantly higher or lower than that of humans?

2. Whether GPT-4 also has difficulty in accurately identifying participants.

3. Whether the model is more likely to judge real humans as humans than to judge the best GPT-4 participant as humans.

4. Whether GPT-4, as an adjudicator, will show higher accuracy in longer text records.

5. Is GPT-4 more accurate as an adjudicator than GPT-3.5?

The recognition accuracy of the two models is explored, as well as the impact of conversation length and model quality on judgment, and the role of the quality of the large language model in judging transcripts.

method

Using transcripts of over 6,000 Turing tests from Jones and Bergen's 2023 study.

Among them, the AI participants in Jones and Bergen's study were based on the following three models: GPT-3.5, GPT-4 and ELIZA.

Furthermore, the sample needs to balance the number of human and AI participants, and the proportion of games of the best-performing GPT-4 participants should be the same as in the entire dataset. And ensure that the distribution of human and AI judgments for the best-performing models is the same.

The model needs to determine the type of participant and indicate the level of confidence in its judgment, while providing the basis for the judgment.

result

GPT-4’s accuracy as an arbitrator was significantly lower than that of interactive human arbitrators, especially when judging the best-performing AI participants, where the accuracy was below chance.

Transcript length and model quality (GPT-4 vs GPT-3.5) had no significant impact on accuracy.

discuss

In the reverse Turing test, both AI systems were less accurate as adjudicators than interactive human judges, especially when faced with the best-performing AI participant.

The passing rate of AI participants was higher than that of human participants, which means that AI adjudicators are more likely to misjudge AI participants as humans and misjudge human participants as AI.

These results suggest that current large language models perform poorly at distinguishing between human- and AI-generated content — even when they are evaluating content generated by the same large language model.

One possible explanation for this result is that the output of these systems is driven by their representations of the human training data.

As a result, the model may lack the nuanced or context-sensitive representations needed to distinguish the difference between its own output and the training data it imitates.

Current large language models are not suitable as effective tools for detecting AI contributions in conversations.

Misplaced Turing Test

The misplaced Turing test assesses how accurately people can identify AI in an environment where they cannot directly interact with the AI, such as reading social media conversations.

In this test, participants act as human adjudicators and judge whether the participant is human or AI based on a sample of text transcripts.

The study set up a set of questions similar to a reverse Turing test.

method

The same sample of 500 transcripts was presented to 109 students at the University of California, San Diego (84 females, age range 19–31 years), with each participant randomly receiving a subset of 50 transcripts.

Each record was judged by an average of 10 participants.

Participants were asked to indicate their confidence level and the basis for their judgment for each record. Linear mixed-effects models were fitted using the lmertest package in R.

result

The recognition accuracy of misaligned human judges is lower than that of interactive human judges, especially when identifying GPT-based AI models.

Shorter transcripts appear to be more helpful to adjudicators.

discuss

Misaligned environments increase the challenge of identifying large language model AIs, preventing participants from conducting adversarial questioning or observing participants’ response times, potentially causing them to make faster decisions or invest less effort.

In real-world scenarios, when human adjudicators are unable to directly interrogate the source of textual content, these results suggest that people may have difficulty reliably distinguishing between human- and AI-generated content.

Similar to a reverse Turing test, the best performing GPT-4 participants had a higher pass rate than human participants.

This suggests that when reading online conversations between humans and AI models, AI systems may be more easily identified as human, while real humans are less recognizable.

Furthermore, the negative effect of transcript length on accuracy seems counterintuitive: why would less information be more helpful to adjudicators?

In fact, this anomaly may be due to minor confounding factors, particularly in the way transcript length was determined.

In the original study, adjudicators had the option to end a conversation early, which they might have done if they were confident they were communicating with a human or an AI.

Therefore, shorter transcripts were more likely to contain information that was particularly helpful to interaction adjudicators and may have similarly helped misalignment adjudicators.

Results Analysis

Contextual Analysis

A key difference between the misplaced-human experiment and the reverse experiment is that the human adjudicators were presented with 50 transcripts consecutively, whereas the large language model adjudicator was presented with each transcript one by one.

The study found that continuous processing of text records may help human judges better grasp the overall trends of text records and identify the characteristics of recurring AI participants, thereby gradually improving the accuracy of judgments.

To test whether presenting transcripts continuously could also improve the AI adjudicator’s performance, the researchers conducted a similar experiment, continuously showing GPT-4 text transcripts.

The results showed that in this way, GPT-4's accuracy increased from 36.4% to 58%, approaching the performance of human adjudicators.

In addition, when asked about its judgment strategy, GPT-4 stated that its strategy has gradually shifted from simply checking grammar to evaluating complex factors such as conversation consistency, humor and emotional differences, and long-term contextual associations, showing its progress in identifying real human conversations.

This suggests that continuous text presentation can enhance the ability of adjudicators, whether human or AI, to understand and identify text features.

Cause Analysis

To assess the decision-making rationales that different adjudicators relied on when judging whether a text record was generated by a human or AI, the study used the GPT-4o model to analyze the adjudicators’ rationales based on a preset classification scheme.

This scheme, proposed by Jones and Bergen, is designed to capture and categorize the specific reasons adjudicators provide for their decisions.

The results showed that there were significant similarities in the types of reasons cited by both AI and human adjudicators.

The most common reason for judging that a conversation comes from AI is that the content of the conversation seems "unnatural".

When making judgments, AI judges pay more attention to whether the interlocutor shows human-specific social and emotional traits, such as lack of personality. The distribution of reasons given by human judges is relatively more balanced, covering multiple different categories.

When judging that the conversations came from humans, all three adjudicators generally pointed out reasonable responses and informal tone in the conversations, while also paying attention to socio-emotional factors and language style.

This suggests that, regardless of whether the judges were humans or AIs, when making decisions, they tended to emphasize similar features in the conversations that helped them distinguish between human- and AI-generated content.

Statistical Artificial Intelligence Detection Methods

To investigate the effectiveness of statistical methods for detecting AI-generated content in the Turing test, the study used two main methods proposed by Mireshghallah et al.: one is to measure the average log-likelihood value in the transcript, and the other is to measure curvature.

The log-likelihood value refers to calculating the likelihood of text content generation given a preposition; the curvature is evaluated by comparing the probability difference between the text and its similar content.

Experiments using the OPT-125m model show that the two methods have different effects in distinguishing between AI- and human-generated text.

The log-likelihood method gave an overall accuracy of 62%, while the curvature method was even more accurate, at 69%.

The curvature method is particularly effective because it is better at distinguishing between human and AI-generated text, especially when working with the latest and most advanced AI models such as GPT-4.

These findings suggest that while these statistical methods provide a tool for distinguishing between human and AI text, their reliability still needs to be improved when working with state-of-the-art AI models.

Future studies need to explore additional statistical methods to improve detection accuracy in this setting.

Summarize

Naive Psychology

Watt's criterion for passing the reverse Turing test is that the machine's performance should be comparable to that of the human evaluators.

If human evaluators can distinguish between humans and a particular AI, then AI adjudicators should be able to do the same. If humans can’t, then the machine should fail, too.

The study found that the judgment accuracy of GPT-4 was comparable to that of misplaced human judges, with both having an accuracy rate of around 58%.

Through in-context learning (ICL), GPT-4’s performance is even improved to a level almost consistent with that of humans.

However, despite similar accuracy rates, GPT-4 and human adjudicators made different error patterns, suggesting fundamental differences in their thinking and judgment processes.

The findings support Watt’s theory that AI’s mental models differ systematically from those of human adjudicators.

Future research could further explore the differences in how humans and AI think by increasing interactivity and systematically modifying features of text transcripts.

These studies help to better understand the potential and limitations of AI in simulating human mental states.

Actual detection

Reality testing refers to assessing the authenticity of conversations between humans and AI in common environments in daily life, such as social media and forums.

Turing tests in these types of environments are usually flawed, as observers cannot directly interact with the interlocutor to verify whether they are AI.

The study found that in this environment, both GPT-4 and human adjudicators performed poorly in distinguishing between AI and humans, especially when faced with well-performing AI, which were often able to disguise themselves more successfully than real humans.

Statistical AI detection methods, such as analyzing speech curvature, show potential in distinguishing between human and AI-generated messages, with ROC analysis showing that this method can achieve an accuracy rate of 69%.

However, the false positive problem in practical applications may have adverse effects on real users, indicating the limitation of this approach in practical scenarios.

The study also found that the accumulation of experience may improve the accuracy of judges' judgments, and both human judges and GPT-4 showed a trend of improvement after multiple judgments.

Future research could explore learning effects by giving participants feedback, or further explore the interactions between different adjudication tools. These findings highlight the complexity and challenges of AI detection in real-world settings.

in conclusion

In the reverse Turing test, GPT-3.5 and GPT-4 as AI arbitrators, as well as human arbitrators in the misplaced Turing test, are involved in judging whether a participant in the conversation is human.

But the results showed that both AI adjudicators and misplaced human adjudicators were less accurate in the passive reading scenario than adjudicators in the original Turing test of direct interaction.

This suggests that, without active interaction, both humans and current large language models have difficulty distinguishing the two.

References

https://arxiv.org/pdf/2407.08853

This article comes from the WeChat public account "Xinzhiyuan" , author: lumina, and is authorized to be published by 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments