The world's most powerful AI - ChatGPT can pass various exams and even output answers that make it difficult to distinguish true from false.
However, there's one area it's not capable of, and that's solving simple visual logic puzzles.
In a test consisting of a series of brightly colored blocks arranged on a screen, most people were able to spot the connecting pattern.
But GPT-4 was only 1/3 correct on one type of pattern and 3 percent on another, according to a May report by the researchers.
Paper address: https://arxiv.org/pdf/2305.07141.pdf
The team behind this research aims to provide a better benchmark for testing the capabilities of AI systems and help solve problems with large language models such as GPT-4.
Melanie Mitchell, the paper's author, said people in the artificial intelligence field are struggling with how to evaluate these systems.
How effective is AI assessment?
Over the past two or three years, LLM has surpassed previous AI systems in its ability to complete multiple tasks.
They work simply by generating a trusted next word when text is input, based on statistical correlations between words in billions of online sentences.
For chatbots built on LLM, there is an additional element: the human trainer provides a lot of feedback to fine-tune the bot's responses.
What’s amazing is that this autocomplete-like algorithm was trained on a massive store of human language, and the breadth of its capabilities is breathtaking.
Other AI systems may beat LLM on a certain task, but they must be trained on data relevant to a specific problem and cannot generalize from one task to another.
Tomer Ullman, a cognitive scientist at Harvard University, says that broadly speaking, researchers in two camps hold diametrically opposed views on what's going on behind LLM. Some attribute algorithmic achievements to flashes of reasoning or understanding. Others (including himself and others like Mitchell) are far more cautious.
Researchers on both sides of the discussion say tests like logic puzzles that reveal differences in the abilities of humans and AI systems are a step in the right direction.
Brenden Lake, a cognitive computing scientist at New York University, said such benchmarks help reveal the shortcomings of today's machine learning systems and tease out the elements of human intelligence.
Research on how to best test LLMs and the implications of these tests is also useful.
Mitchell said that if LLM is to be applied to various fields in the real world, such as medicine and law. Then it is very important to understand the limits of their capabilities.
Is the Turing test dead?
The most famous test of machine intelligence has long been the Turing test.
The Turing Test was proposed by the British mathematician and computing guru Alan Turing in 1950, when computers were still in their infancy.
Turing proposed an assessment he called the "imitation game."
In this scenario, a "human judge" engages in a brief, text-based conversation with a computer and an unseen person.
Can this human reliably detect which computer is which? Turing said that this is a question equivalent to "can a machine think?"
Mitchell points out that Turing did not specify many details of the scenario, so there is no exact standard to follow.
Other researchers believe that GPT-4 and other LLMs now likely pass the "Turing test" because they can fool many people, at least in short conversations.
In May, researchers at AI21 Labs reported that more than 1.5 million people had played their Turing Test-based online game.
Players correctly identified the bot only 60% of the time, which is no better than chance.
In this game, however, researchers familiar with LLM may still win. By exploiting known weaknesses of AI systems, LLMs can be easily discovered.
The key is to get LLM out of its "comfort zone."
François Chollet, a software engineer at Google, suggested showing LLM some scenarios that were variations of what LLM often saw in its training data. In many cases, LLM answers by spitting out the words most likely to be associated with the original question in the training data, rather than the correct answer to the new scenario.
However, Chollet and others are skeptical of deception-centered testing as a goal of computer science.
Benchmarking is dangerous
In contrast, researchers typically do not employ Turing tests when evaluating AI systems, but instead use benchmarks designed to assess performance in specific abilities, such as language proficiency, common-sense reasoning, and mathematical ability.
A growing number of research groups are also turning to academic and professional exams designed for humans.
When GPT-4 was released, OpenAI tested its performance on a series of benchmarks designed for machines, including reading comprehension, mathematics, and coding.
According to the technical report, GPT-4 achieved excellent results in most of these tests.
In addition, GPT-4 also took 30 exams, including GRE, exams that assess the current status of clinical knowledge of American doctors, various subject-specific exams designed for American high school students, etc.
Later, one of the challenges mentioned by researchers is that the models are trained on large amounts of text, and they may have seen similar questions in the training data and therefore may actually be looking for answers. This question is actually "polluted".
The researchers also note that LLM's success on exam questions may be a one-hit wonder and may not translate into strong abilities needed in the real world.
There is a deeper issue when it comes to interpreting what these benchmarks mean.
A person who performs well on a test can generally be considered to have performed well on other cognitive tests and has mastered certain abstract concepts.
However, LLMs work very differently than humans. Therefore, it is not always valid to extrapolate to artificial intelligence systems in the same way that we judge humans.
This may be because LLM can only learn from language. Without being in the physical world, they cannot experience the connection of language to objects, properties, and emotions the way humans do.
It's obvious that they understand words differently than humans do.
On the other hand, LLMs also have abilities that humans don’t have. For example, they know the connection between almost every word that humans have ever written.
Nick Ryder, a researcher at OpenAI, agrees that performance on one test may not be as generalizable to people who get the same score.
"I don't think we should draw any equivalent conclusions from our evaluations of humans and large language models," he said. OpenAI's score "does not represent human ability or reasoning ability. It is intended to illustrate how well the model performs at the task."
AI researchers say a broader and more rigorous review is needed to find out the strengths and weaknesses of LLM. Colorful logic puzzles could be one candidate.
Logic puzzles appear
In 2019, before the LLM outbreak, Chollet published online a new artificial intelligence system logic test he created called the Abstraction and Reasoning Corpus (ARC).
Solvers are asked to see a visual demonstration of several squares changing into another pattern, and to show that they have grasped the basic rules of change by pointing out how the next square will change.
Chollet said ARC captures "the hallmarks of human intelligence." The ability to abstract from everyday knowledge and apply it to problems that have never been seen before.
Currently, several research teams have now used ARC to test the capabilities of LLM, and none have been able to achieve near-human performance.
Mitchell and her colleagues created a new series of puzzles — called ConceptARC — that are inspired by ARC but differ in two key ways.
ConceptARC testing is easier. Mitchell's team wants to make sure the benchmark doesn't miss even small advances in machine capabilities. Another difference is that the team selects specific concepts to test and then creates a series of puzzles for each themed variant concept.
What does poor performance mean?
The researchers assigned the ConceptARC task to GPT-4 and 400 online applicants.
Humans scored an average of 91% across all concept groups (including 97% in one group); GPT-scored 33% in one group and less than 30% in all other groups.
Researchers have proven that AI still cannot approach human performance. Surprisingly, however, it can solve problems it has never been trained to solve.
The research team also tested the leading chatbot in the Chollet competition.
Overall, they did better than GPT-4 but performed worse than humans, scoring highest in one category at 77% but scoring less than 60% in most categories.
However, Bowman said that GPT-4's failure on the ConceptARC exam does not prove that it lacks basic abstract reasoning capabilities.
Actually, ConceptARC has some disadvantages against GPT-4, and one of the reasons is that it is a visual test.
Currently, GPT-4 can only accept text as input, so the researchers gave GPT-4 an array of numbers representing images. In contrast, human participants saw the images.
reasoning argument
Bowman points out that, taken together with other experiments, LLM has acquired at least a rudimentary ability to reason about abstract concepts.
However, LLM's reasoning ability is generally "uneven" and more limited than human reasoning ability. However, as the parameter scale of LLM increases, the reasoning ability will increase accordingly.
Many researchers agree that the best way to test LLM's abstract reasoning abilities and other signs of intelligence remains an open, unsolved question.
References
https://www.nature.com/articles/d41586-023-02361-7
This article comes from the WeChat public account "Xin Zhiyuan" (ID: AI_era) , author: Taozi, and 36 Krypton is published with authorization.



