And I was right!! IQuest-Coder was set up incorrectly and includes the whole git history, including future commits. The model has found this trick and uses it rather often. Thus, its SWE-bench score should be discarded.

Xeophon
@xeophon
01-01
Your timeline will be full of this image. If you believe this is a real model, I have a bridge to sell to you.
For starters, they don’t disclose how they run those evals, which is a huge red flag.
But good luck to the poor soul who’ll get nerdsniped by this. x.com/xianbao_qian/s…

From Twitter
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments
Share




