METR updates AI agent capability benchmarks; Gemini 3.1 Pro surpasses all cutting-edge models in reliability, taking the top spot.

This article is machine translated
Show original
According to ME News, on April 16th (UTC+8), Beating, an AI safety assessment organization, METR updated its "Time Horizon" benchmark, adding test data for the Google Gemini 3.1 Pro. This benchmark tracks the upper limit of cutting-edge AI agents' ability to independently complete programming tasks and has become an important reference for measuring the growth of AI agent capabilities since its launch in February this year. The measurement method involves having human software engineering experts (averaging about 5 years of experience) and AI agents complete the same set of over 100 software tasks, using human time to measure task difficulty. There are two core metrics: 50% Time Horizon (the highest difficulty task that the AI has a 50% probability of completing) and 80% Time Horizon (the highest difficulty task that the AI has an 80% probability of completing). The Gemini 3.1 Pro's ranking on both metrics has reversed. In the 50% time horizon, it ranks second, only behind the significantly ahead Claude Opus 4.6: 1. Claude Opus 4.6: approximately 12.0 hours 2. Gemini 3.1 Pro: approximately 6.4 hours 3. GPT-5.2: approximately 5.9 hours 4. GPT-5.4: approximately 5.7 hours However, in the more stringent 80% time horizon, Gemini 3.1 Pro surpasses it to take the top spot: 1. Gemini 3.1 Pro: approximately 1.5 hours 2. Claude Opus 4.6: approximately 1.2 hours 3. GPT-5.2: approximately 1.1 hours Claude Opus 4.6 can tackle more difficult tasks but its success rate fluctuates greatly, while Gemini 3.1 Pro has a lower ceiling but is more stable within its capabilities. For production scenarios requiring predictable results, the latter may be more practical. Compared to its predecessor, the Gemini 3 Pro (50% time horizon approximately 3.7 hours), the Gemini 3.1 Pro represents an improvement of about 71%. Looking at a longer timeline, METR data shows that the time horizon of cutting-edge models has increased from a few seconds in GPT-2 in 2019 to over ten hours today, roughly doubling every 4.3 months. METR states that it "sees no signs of slowing exponential growth." It's important to note that METR's tasks cover software engineering, machine learning, and cybersecurity, all of which are well-defined, automatically scored independent tasks. Subsequent METR research found that AI performance significantly declined when the scoring method changed from algorithmic judgment to overall human evaluation. A 12-hour time horizon does not mean that AI can replace half a day of human work. (Source: ME)

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments