According to Foresight News, OpenAI has launched PaperBench, a benchmark test to evaluate AI agents' ability to replicate research. AI needs to replicate 20 top papers from ICML 2024, involving understanding papers, writing code, and conducting experiments. The test is conducted using refined scoring criteria developed in collaboration with the original authors, covering 8,316 specific requirements and judged by LLM. The results show that Claude 3.5 Sonnet (New) combined with open-source frameworks performed the best, with an average replication score of 21.0%, but still has not surpassed the human baseline.
OpenAI launches PaperBench benchmark to assess AI's ability to replicate research
This article is machine translated
Show original
Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments
Share