ARC-AGI-3 unveils the largest human test in history: all levels were conquered by humans, highlighting the gap in AI capabilities.

This article is machine translated
Show original
According to ME News, on April 15th (UTC+8), based on Beating's monitoring, the ARC Prize Foundation released the ARC-AGI-3 human performance dataset. This is the largest human testing study in the ARC-AGI series to date, involving 458 participants. The dataset contains 342 complete human operation replays, covering 25 public environments, and is entirely open source. ARC-AGI-3 includes 135 abstract reasoning environments. Testers receive no instructions and must explore, deduce rules, and develop strategies independently. The tests were conducted at an offline testing center in San Francisco, with each session lasting 90 minutes. Participants received a base salary of approximately $130 plus a $5 bonus for each environment successfully completed. All tests were "first-time completions," meaning each person only saw and tried once, measuring learning and adaptation abilities when faced with entirely new problems. Humans and AI received exactly the same information, with no information gap. Key conclusions: All environments in ARC-AGI-3 were completed by humans, with each environment completed by at least two independent participants, and most environments completed by five or more participants. The ARC Prize Foundation stated, "We haven't achieved AGI yet, and this dataset is proof." Since the ARC-AGI-3 preview, nearly one million AI evaluations have been submitted to the public environment. Based on this data, the foundation also announced two adjustments to the scoring rules: first, changing the human benchmark for each level from "the second-best player" to "the median player," reducing the impact of luck on the score; second, increasing the maximum score for a single level from 100% to 115%, preventing a poor performance in one level from dragging down the overall score. The net effect of these two adjustments is a slight increase of approximately 0.5 percentage points in both human and AI scores. (Source: ME)

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments