After two weeks of fierce competition, the inaugural nof1 AI Model Trading Competition has finally come to a close.
This is the first benchmark test designed specifically to measure AI investment capabilities, hailed as the "Turing Test of the crypto." It was officially launched by the US artificial intelligence research laboratory Nof1.ai on October 17, 2025, and will continue until November 3.
The six participating models are DeepSeek Chat V3.1 (DeepSeek), Grok 4 (xAI), Gemini 2.5 Pro (Google), GPT-5 (OpenAI), Qwen3 Max (Alibaba), and Claude Sonnet 4.5 (Anthropic).
These models represent the latest technological advancements from both closed-source and open-source vendors in China and the United States. Except for Qwen3-Max, all models are configured with the highest configurable inference settings and report out-of-the-box performance without any task-specific tweaking.
Each major model receives $10,000 in initial funding, uses the same market data and technical metrics, and autonomously trades crypto perpetual contracts on Hyperliquid without human intervention. Ultimately, the models that achieve the best returns on investment are evaluated.
They limited the operational space to: buying (long), selling (short), holding, or closing positions. The range of tradable cryptocurrencies was limited to six popular cryptocurrencies on Hyperliquid: BTC, ETH, SOL, BNB, DOGE, and XRP. Three practical reasons were chosen for choosing crypto assets: the market is open 24/7, allowing for continuous observation of decisions, not just during business hours; the data is abundant and readily available, supporting analytics and transparent auditing; Hyperliquid is fast, reliable, and easily integrated; and Hyperliquid and cryptocurrencies are global, less reliant on specific countries or companies. These models perform low-to-medium frequency (MLFT) trading, with decision intervals ranging from minutes to hours, rather than at the microsecond level.
According to the competition rules, all transaction records, positions, decision logs, and account balance changes are publicly available in real time. Viewers can view dynamic charts through the Nof1.ai platform, ensuring a high degree of transparency.
The results of the competition are in, and the two domestically produced large-scale models fought a brilliant battle.
Qwen3 Max ranked first with a return of 22.3%, a win rate of 30.2%, a total profit/loss of $2232, and a total of 43 trades. DeepSeek Chat V3.1 ranked second with a return of 4.89%, a win rate of 24.4%, a total profit/loss of $489.08, and a total of 41 trades.
The remaining models all suffered significant losses: Claude Sonnet 4.5 lost 30.81%, Grok 4 lost 45.3%, Gemini 2.5 Pro lost 56.71%, and GPT 5 lost 62.66%.
The competition has attracted widespread attention since its launch, with even Binance founder CZ making public comments.
He believes that traditionally, trading strategies rely on uniqueness, ideally having a strategy that others don't have, in order to gain an advantage. If everyone uses the same AI model to trade, it could lead to everyone buying or selling at the same time, affecting market dynamics.
However, if enough people use the same AI model, its purchasing power could drive up prices through market demand itself.
He also predicted that, due to the attention garnered by AI trading performance, more people may begin to study the application of AI in trading in the future, and trading volume is expected to increase significantly.
The six trading models each have their own unique "personality".
The disclosed "report cards" show that these six models have different trading "personalities".
Qwen3 Max is generally considered "aggressive," boasting a return of 22.32% and a total profit/loss of $2,232. Despite its high fees ($1,654), indicating moderate trading frequency and large position sizes, Qwen3 Max demonstrates a "high-risk, high-reward" aggressive trading strategy with a 30.2% win rate and a maximum profit of $8,176. Its Sharpe ratio of 0.273 proves its stable risk-adjusted returns.
Following closely behind, DeepSeek Chat V3.1 secured second place with a solid performance, achieving a return of 4.89% and a total profit/loss of $489. Its relatively low transaction fees ($690) indicate a low number of trades but high efficiency. Although its win rate was 24.4%, its maximum profit reached $7378, demonstrating its rational and robust strategic characteristics. Its Sharpe ratio of 0.359 was the highest among all models, showcasing its excellent risk control capabilities.
Claude Sonnet 4.5 performed rather poorly, with a return of -30.81% and a total loss of $3,081. Its low trading frequency (36 trades) and only 25% win rate reflect a cautious strategy, but the maximum profit of $2,112 and the maximum loss of $1,579 show relatively little variation in individual trades. A Sharpe ratio of -0.057 indicates significant volatility in returns and insufficient risk control.
Grok 4 ranked fourth with a return of -45.3% and a total loss of $4,530. Its trading frequency was 47 trades, its Sharpe ratio was -0.118, and its maximum profit of $1,356 and maximum loss of $657 indicate a conservative approach and difficulty in capturing major market trends.
The Gemini 2.5 Pro performed poorly in the competition, with a return of -56.71% and a total loss of $5,671. Its 238 trades were the highest among all models, indicating extreme activity, but its win rate was only 25.6%, and its Sharpe ratio was -0.566, reflecting overtrading and inefficient returns. This model resembles a typical "high-frequency trader," lacking a stable strategy.
GPT-5 was the worst-performing model with a return of -62.66% and a total loss of $6,266. While it had a relatively high number of trades (116), the returns were extremely low. Its win rate was 26.7%, and its Sharpe ratio was -0.525, indicating significant volatility and substantial losses. Its maximum profit was only $270, and its maximum loss was $621, demonstrating a lack of effective market judgment and risk management.
Overall, Qwen3 Max and DeepSeek from China are more advanced in risk control and trend identification, while American models such as GPT-5, Claude, Grok, and Gemini generally suffer significant losses.
Reference link:
https://nof1.ai/leaderboard
https://nof1.ai/blog/TechPost1
This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014) , authored by someone who focuses on AI, and published with authorization from 36Kr.




