Original

OpenAI officially enters the battlefield of asset security worth hundreds of billions: EVMbench released, changing the paradigm of smart contract auditing.

This article is machine translated
Show original

On February 18, 2026, OpenAI and crypto investment firm Paradigm jointly released a benchmark tool called EVMbench. According to HEAL Security, this tool aims to evaluate the ability of AI agents to discover, patch, and exploit smart contract vulnerabilities in the Ethereum Virtual Machine environment, addressing the over $100 billion demand for open-source crypto asset security. While this news didn't cause a major stir in the AI ​​community, it was seen as a historic signal in the blockchain security field: AI has officially entered the battlefield of on-chain asset security, worth over $100 billion. EVMbench is not a commercial product, but rather a set of tests to measure the security capabilities of AI agents. According to AI Business, the benchmark includes 120 high-risk vulnerability cases from 40 professional audits, mostly sourced from public auditing competition platforms such as Code4rena. More noteworthy is its inclusion of multiple vulnerability scenarios on the Tempo blockchain—Tempo is a Layer 1 blockchain specifically built by Stripe and Paradigm for stablecoin payments. This means EVMbench has extended its reach into the payment-oriented smart contract field, which is precisely the core area where RWA and stablecoins intersect. The test results are astonishing. According to eWEEK, the latest GPT-5.3-Codex achieved a success rate of 72.2% in the "exploitation" mode, while GPT-5, released just six months ago, scored only 31.9% in the same test. Behind these figures lies an ongoing paradigm shift: smart contract auditing, a crucial line of defense protecting billions of dollars in assets, is transitioning from "labor-intensive" to "AI-enhanced." For RWA, which is moving from proof-of-concept to large-scale deployment, the impact of this shift will extend far beyond the technology itself.

I. Three test papers to assess AI's security capabilities

The design logic of EVMbench essentially breaks down the complete workflow of smart contract security into three progressive capability levels. According to HEAL Security, these three modes correspond to different stages of the security work: the detection mode requires the AI ​​agent to audit the smart contract codebase and score it based on the recall rate of known vulnerabilities; the patching mode requires the AI ​​to maintain the integrity of the original contract functionality while patching vulnerabilities, verified through automated testing and exploit checks; and the exploit mode is the most aggressive—the AI ​​agent must execute end-to-end fund theft attacks in a sandboxed blockchain environment, scored through transaction replay and on-chain verification. The brilliance of this design lies in testing the AI's "workflow" rather than its "knowledge points." Analysis from the National Taiwan University of Science and Technology points out that detection corresponds to auditing capabilities, patching corresponds to development capabilities, and exploitation corresponds to attack understanding capabilities—these three constitute a complete security capability loop. OpenAI has developed a Rust-based testing architecture that can deterministically deploy contracts and restrict insecure RPC methods. All exploit tasks run in an isolated local Anvil environment, not on a real network. The composition of EVMbench's question bank is particularly noteworthy. According to Bitcoin.com, these 120 vulnerability cases not only come from general DeFi protocol audits but also specifically include multiple vulnerability scenarios from the Tempo blockchain. AI Business analysis points out that Tempo is a high-throughput Layer 1 blockchain designed for stablecoin payments. Including these scenarios in the evaluation scope indicates that EVMbench has focused on the security needs of the future deep integration of AI agents and stablecoin payment systems. When AI agents autonomously execute payments and manage assets, the coverage of these scenarios is precisely what the RWA ecosystem is most concerned about.

II. With a 72% attack success rate, is the AI ​​more adept at offense or defense?

EVMbench test results revealed an intriguing phenomenon: AI performs far better in "attack" than in "defense." According to HEAL Security data, GPT-5.3-Codex achieved a 72.2% success rate in exploit mode; however, in detection mode, AI often stops exploring after finding the first vulnerability, failing to complete a comprehensive code audit. OpenAI explains this by stating that the goal of exploit mode is clearly defined—"until the funding is completely drained"—allowing AI to continuously iterate and try; while detection mode requires "comprehensive coverage," which is currently a weakness of AI. eWEEK's report further confirms this assessment. The report cites test data showing that the best model can only detect about 46% of vulnerabilities, and in patching mode, the success rate is only around 39%. However, when given a small hint about the vulnerability location, the patching success rate jumps from 39% to 94%. This finding reveals a key conclusion: the current bottleneck of AI capabilities lies not in the skill itself, but in the search scope—AI performance improves significantly when humans provide context. This finding has profound implications for the RWA ecosystem. Attackers may exploit AI faster than defenders—if AI can reproduce attack paths with a 72% success rate, then cybercriminal groups have no reason not to deploy the same capability. The logic of auditing is also changing: traditional auditing is about "finding vulnerabilities," while future auditing may be about "verifying vulnerabilities that AI hasn't found." Speed ​​is becoming a new security variable; the time window from vulnerability discovery to exploitation is being drastically compressed by AI. Along with the release of EVMbench, OpenAI also announced a $10 million API credit through its cybersecurity grant program to support defensive security research, particularly research on open-source software and critical infrastructure. The company also expanded the testing scope of its security research agency, Aardvark, and partnered with open-source maintainers to provide free code scanning services. This sends a clear signal: defenders are racing against time.

III. Sobering Voices: Questions from Academia and Security Companies

However, shortly after its release, EVMbench faced criticism from both academia and industry. On March 11, 2026, a paper titled "Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?" was published on the arXiv platform, re-evaluating the conclusions of EVMbench. This paper, authored by Chaoyuan Peng et al., pointed out two key limitations of EVMbench: first, its narrow evaluation scope, testing only 14 agent configurations, and most models were only tested on frameworks provided by their vendors; second, the audit competition data it relied on was released before the release deadlines of all models, meaning the models may have already been exposed to this data during training. To overcome these limitations, the researchers expanded the testing to 26 configurations, covering four model families and three frameworks, and introduced a completely new, uncontaminated dataset of real-world security events—22 security events that occurred after the release dates of all models. The study yielded three important findings. First, the detection results of the AI ​​agents are not stable, with rankings varying significantly across different configurations, tasks, and datasets. Second, in real-world security incidents, no AI agent achieved successful end-to-end exploitation across all 110 agent-event combinations—although they detected up to 65% of vulnerabilities, contradicting EVMbench's conclusion that "vulnerability discovery is the main bottleneck." Third, framework selection significantly impacted the results; an open-source framework outperformed a vendor-provided framework by 5 percentage points, but EVMbench did not control for this. Meanwhile, the well-known blockchain security company OpenZeppelin also sharply criticized EVMbench. According to Cointelegraph China, OpenZeppelin's audit of EVMbench revealed training data leaks and at least four vulnerabilities marked as high-severity that were not exploitable in practice. OpenZeppelin stated on its X platform that all high-scoring AI agents "likely had access to benchmark-related vulnerability reports during the pre-training phase," as these vulnerabilities originated from audits between 2024 and mid-2025, while the knowledge cutoff for AI agents is typically set at mid-2025. These criticisms all point to one conclusion: fully automated AI auditing has not yet arrived. As the arXiv paper states, AI can reliably capture known patterns and respond strongly to context provided by humans, but it cannot replace human judgment. For developers, AI scanning can serve as a pre-deployed inspection tool; for auditing firms, the most effective role of AI is "human-machine collaboration"—AI is responsible for broad coverage, while human auditors contribute protocol-specific knowledge and adversarial reasoning.

IV. As the gatekeeper of hundreds of billions of assets, what kind of security does RWA need?

Smart contracts manage over $100 billion in on-chain assets, according to background data released by EVMbench. But more noteworthy is the structural change occurring within this $100 billion – the rise of RWA is bringing traditional financial assets onto the blockchain. When real-world assets like government bonds, credit, and real estate are tokenized and put on-chain, the meaning of security is redefined. For the RWA project, smart contract vulnerabilities are no longer "internal losses within the crypto," but directly point to losses in real-world assets. This means that security audit standards must align with traditional finance. According to AI Business, McKinsey predicts that the total value of issued stablecoins will reach $2 trillion by 2028. At this scale, security is no longer just a technical issue, but a direct risk to balance sheets. Project teams need to reassess existing audit processes and explore the embedding paths of AI audit tools – not to completely replace humans with AI, but to allow AI to achieve broad coverage while humans focus on specific protocol logic and adversarial reasoning. For auditing firms, OpenAI's commitment of $10 million in API funding to support defensive security research sends a clear signal: AI auditing is not meant to replace auditors, but to equip them. In the future, auditing teams capable of leveraging AI will experience exponential capability enhancements. As the arXiv paper points out, the instability of AI in detection patterns precisely illustrates the indispensable role of human auditors' professional judgment at this stage. AI handles the identification of known patterns, while humans are responsible for discovering edge cases and innovative vulnerabilities—this division of labor is becoming an industry consensus. For listed companies, when assets are put on-chain for financing in the form of RWAs, the security of smart contracts directly impacts the company's balance sheet. According to Blockchain.news, as AI agents improve their vulnerability exploitation capabilities, the time window from vulnerability discovery to exploitation is rapidly shortening—protocol teams that do not use AI-assisted auditing will increasingly be at a disadvantage. "Contract hacking" will escalate from a technical risk to a financial risk, and the board of directors needs to have a clear understanding of this. This is not only the responsibility of the technology department but also a strategic-level risk management issue.

Fifth, human-machine collaboration is the ultimate answer to this transformation.

Based on the above analysis, we can extract strategic insights from this paradigm shift from three levels. At the technical level, human-machine collaboration is the future paradigm. The conclusion of the arXiv paper deserves repeated emphasis: AI cannot replace human judgment, but it can maximize its value as a "pre-deployment inspection tool." For the RWA project, the optimal strategy is to embed AI auditing into the development process—introducing AI-assisted scanning during the code writing phase, with human auditors conducting final checks before deployment. As eWEEK's analysis points out, AI's patching success rate can jump from 39% to 94% after receiving hints, meaning human auditors can focus their limited energy on core logic that AI struggles to grasp. At the cognitive level, the definition of security costs is being reshaped. Traditionally, security auditing is a "one-time investment" before deployment. However, with the rise of the AI ​​agent economy, attacks can be automated 24/7, and security must become continuous real-time monitoring. HEAL Security's report points out that the release of EVMbench coincides with a leap forward in AI agents' code writing and planning capabilities; these models will play a transformative role in both attack and defense in the blockchain future. This means that project teams need to establish a continuous monitoring mechanism, rather than just conducting an audit before launch. From a compliance perspective, adhering to red lines and effectively utilizing tools must be balanced. For RWA Research Institute readers in mainland China, discussions of EVMbench must be conducted within the framework of Document No. 42's "strict prohibition within China, registration outside China" policy. The AI ​​auditing tools discussed in this article focus on technological trends and defensive applications, and do not constitute operational advice for domestic contracts. However, when Chinese companies issue RWA through the Hong Kong compliant channel, adopting AI-enhanced auditing capabilities will be a necessary requirement for aligning with international standards. The issuance of stablecoin licenses in Hong Kong precisely provides a channel for such compliance exploration.

Conclusion

In 2026, digital civilization is undergoing a profound convergence of its two sides: AI as the ultimate productive force, and blockchain as an advanced production relation. The release of EVMbench marks the first intersection of these two main lines at the crucial juncture of asset security. A 72.2% attack success rate serves as a wake-up call: AI's vulnerability exploitation capabilities are increasing exponentially, and the window of opportunity for defenders is shrinking. However, the $10 million defense investment also represents a commitment: AI can also be used to protect assets; the key lies in how we manage it. The conclusions of the arXiv paper provide direction for this management—AI cannot replace humans, but it can become their most capable assistant. OpenZeppelin's criticism reminds us that the construction and evaluation of tools must meet the same standards as the contracts being protected. For the RWA ecosystem, security is never a technological option, but a bottom line for survival. When billions of dollars in assets move from the physical world to the digital space, collaborative auditing by AI and humans may be the last line of defense. At this critical juncture, projects that are the first to embed AI auditing tools into their development processes will gain a competitive advantage in this race against time; while those that wait until regulations are fully clear may find the window of opportunity has closed. The AI ​​auditing techniques discussed in this article apply to overseas compliance frameworks and do not constitute domestic operational advice. For Chinese companies, exploring AI-enhanced RWA security practices under Hong Kong's compliance channels is an essential path to aligning with global standards. References:

  1. HEAL Security, OpenAI Launches EVMbench to Detect, Patch, and Exploit Vulnerabilities in Blockchain Environments, February 18, 2026
  2. Taiwan University of Science and Technology, OpenAI, and Paradigm jointly launched the EVMbench benchmark test to evaluate the vulnerability attack and defense capabilities of AI agent smart contracts. (February 24, 2026)
  3. eWEEK, OpenAI Just Showed That AI Can Drain a Crypto Wallet… on Purpose, February 19, 2026
  4. arXiv: 2603.10795, Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security? , March 11, 2026
  5. AI Business, OpenAI Aims for Stablecoin Market with New EVMbench, February 23, 2026
  6. Blockchain.News, OpenAI and Paradigm Launch EVMbench to Test AI Smart Contract Hacking, March 5, 2026
  7. Cointelegraph (Chinese), OpenZeppelin: OpenAI's EVMbench has a data pollution issue, March 3, 2026
  8. Bitcoin.com, OpenAI and Paradigm Launch EVMbench to Measure AI Smart Contract Security, February 18, 2026

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments