Original source: Beosin
In recent years, large language models such as GPT-4, Claude, and Gemini have demonstrated strong code comprehension capabilities, effectively reading smart contract languages like Solidity, Rust, and Go, and identifying classic vulnerabilities with distinct code characteristics, such as reentrancy attacks and integer overflows. This has led the industry to consider whether large models can be used to assist or even replace manual contract auditing.
Because general-purpose models lack sufficient understanding of the business logic of specific projects, they have a high false positive rate when dealing with complex DeFi protocols, and are prone to missing vulnerabilities that require cross-contract interactions or economic models to detect. Later, the industry proposed adding a "Skill" mechanism—injecting specialized knowledge bases, detection rules, and business contexts for smart contract security into the general-purpose model, giving the model clearer criteria for judgment during audits, rather than relying solely on general capabilities to determine if there are problems with the code.
Even with Skill enhancements, AI auditing still has a clearly defined scope of application. It excels at scanning known vulnerability patterns and checking code style, but it currently struggles to effectively handle complex vulnerabilities requiring a deep understanding of the overall protocol design, cross-contract interaction logic, or economic models . These types of issues still require experienced auditing experts, and in scenarios involving complex computational logic, formal verification is needed to provide stronger safeguards. Against this backdrop, Beosin has developed a three-tiered auditing model: Skill-enhanced AI baseline checks + in-depth human auditing + formal verification. Each component has its own focus and complements the others.

I. Audit capability boundaries of general AI models: controlled comparative testing and case analysis
This article selects two types of contracts with significantly different levels of complexity from the project library that has already undergone manual auditing as test cases: one type is simple contracts with relatively independent logic and clear functional boundaries. These projects are usually the scenarios where AI auditing tools have the most abundant training data and are theoretically the most advantageous; the other type is complex contracts involving multi-contract interactions, complex state machines, or cross-protocol dependencies. This is also the high-risk scenario that is most often brought up when discussing whether AI can replace manual auditing in the industry.
In the comparison, we used the exact same codebase, first letting the AI run the audit independently to generate a report, and then aligning it line by line with the human audit report. The production processes of the two reports were completely independent—the human auditors had no idea of the AI's results when generating their reports, thus avoiding mutual influence. Finally, we will analyze the results from the following four dimensions:

Case A: Standard Token Contract (BSC-USDT / BEP20USDT.sol)
For the first set of tests, we selected a standard BEP-20 token contract written in Solidity 0.5.16. Its logic is relatively independent, its functional boundaries are clear, and it doesn't involve any cross-contract interactions. The main security risks are concentrated on some common, known vulnerability patterns. Theoretically, this type of contract is currently the most advantageous scenario for AI auditing—there are many such standard token contracts in the training data, and their rule-based vulnerability characteristics are quite obvious.

The AI generated 6 alerts (2 high-risk, 1 medium-risk, and 3 low-risk/suggested), which is a considerable number. The low-risk and suggested alerts were generally accurate, covering common coding style issues such as outdated Solidity versions and improper state variable exposure, and have some reference value. However, both of the AI's "high-risk" alerts were misjudged. The AI marked the owner's minting rights and centralized permissions as high-risk vulnerabilities—in reality, for centralized stablecoins (like USDT), the owner's minting rights are an expected design feature, and risk assessment should be based on a comprehensive judgment considering multi-signature control, permission governance mechanisms, and contract upgrade strategies. The rationality of such permission structures fundamentally depends on the project's business model rather than the code itself . The AI lacked this context and could only make judgments based on pattern matching.

This test case shows that AI can recognize the permission structure, but it cannot judge whether the permissions are reasonable in combination with the business context. Therefore, it directly marked the owner's minting rights of USDT-type contracts as a "high-risk vulnerability". This is a typical misjudgment that is divorced from the actual business logic. Such false alarms may interfere with the project team's judgment of the real risks.
Case B: Complex Business Contract (IPC Protocol / 2025-02-recall)
The second set of tests selected the IPC Protocol project from the Code4rena platform's public report (report link: code4rena.com/reports/2025-02-recall). This project includes several interdependent core components such as Gateway, SubnetActor, and Diamond proxy pattern. Its security heavily relies on a deep understanding of the overall protocol architecture and cross-component interaction logic, representing a typical scenario for high-value attacks in the DeFi ecosystem. Below are the AI audit results:

For complex contracts, the AI audit generated 3 high-risk and 6 medium-risk alerts, a respectable output volume. However, a significant proportion of these were deemed false alarms by auditors—the AI made incorrect risk assessments of code snippets lacking context. Meanwhile, of the 9 High-level vulnerabilities confirmed by auditors, the AI only fully covered 1, 2 were discovered but their ratings were significantly lowered (actually High, but reported as Medium by the AI), and the remaining 6 were completely undetected. Of the 4 Medium-level vulnerabilities, the AI covered 1, and 3 were completely missing.
These vulnerabilities share a common characteristic: they all rely on comprehensive reasoning about the cross-component state transition paths of the protocol, rather than pattern matching of a single function. Taking H-01 (signature replay) in the manual audit report as an example, the exploit path requires understanding the design intent of multi-signature verification, how the attacker constructs a set of duplicate signatures, and how this behavior bypasses the weight threshold. H-06 (leave() function reentrancy attack) is similar: the vulnerability only exists in the subnet bootstrap critical state, requiring an understanding of the cross-dependencies between staking flow, bootstrap triggering conditions, and the timing of external calls. Similar deep logic vulnerabilities are not recorded in the AI's alert list.

The results show that in auditing complex contracts, AI's auditing capabilities lie in pattern recognition of local code, while protocol-level vulnerabilities may stem from a misunderstanding of the overall business logic. When the triggering conditions of a vulnerability span multiple contracts, multiple states, and multiple call levels, AI's current reasoning capabilities are insufficient to effectively cover them.
Based on the two cases, AI auditing is not without value— it makes substantial contributions to the coverage of known vulnerability patterns, code style checks, and the discovery of some independent perspectives . However, its value boundaries are very clear: it can serve as a baseline scan, but cannot be used directly as a security conclusion. For complex protocols, relying solely on AI reports to make security judgments will not only miss high-risk vulnerabilities, but also consume a significant amount of the team's screening time due to a large number of low-quality alerts. This is precisely the core reason why Beosin has established a dedicated Skill knowledge base and introduced a three-audit model mechanism into its auditing process.
II. Dedicated Skill Knowledge Base: An Engineering Path to Enhance AI Baseline Inspection
To integrate AI auditing into the baseline audit process, it's crucial to address its high false positive and false negative rates when auditing real-world DeFi protocols. Whether it's access control, AMM liquidity mechanisms, cross-chain bridge message verification, or the liquidation logic of lending protocols, AI currently only performs simple matching based on surface-level code features. It struggles to combine these with specific business scenarios and attack/defense logic to determine whether a piece of code is problematic. The key to solving this problem is to inject the years of experience accumulated by auditing experts into the AI's judgment process in a structured manner, giving it a certain level of business understanding.
However, it's important to clarify that even with Skill enhancements, AI's role in auditing remains unchanged. For complex issues involving multi-contract interactions, economic model analysis, and novel attack methods, human auditing remains irreplaceable . The role of Skill enhancements is to elevate the quality of initial scans to a truly useful level, within the scope of AI's capabilities (such as identifying common vulnerability patterns and understanding business logic to a limited extent), providing more valuable initial results for human auditing, rather than generating a bunch of invalid alerts that require repeated scrutiny.
2.1 Extracting from Audit Practice: The Construction Mechanism of Skill Rules
Beosin's Skill knowledge base originates from over 4,000 smart contract projects that have undergone manual auditing. Audit experts have extensively summarized and refined these rules, verifying and organizing them one by one. Each rule's formation follows the complete process from vulnerability discovery to rule implementation: after discovering security issues in real-world projects, auditors fully reconstruct the attack path, deeply analyze the root causes, verify the effectiveness of remediation solutions, and ultimately organize this entire set of attack and defense knowledge into rule entries with contextual judgment conditions, incorporating them into the Skill library for subsequent auditing.
The following is a sample rule from the Skill library, which includes a structure across four dimensions: vulnerability pattern, attack path, root cause, and remediation recommendations :
[Beosin-AMM_Skill-1] Added liquidity detection bypass via transfer order
Vulnerability Pattern: The contract determines whether a liquidity addition operation is occurring by checking if the WBNB balance in the Pair exceeds the reserve (balanceOf >= reserve + required). This detection relies on the assumption that WBNB arrives in the Pair before the tokens, but the Router's addLiquidityETH function always transfers ERC-20 tokens first and then WETH, and the transfer order in the addLiquidity function is determined by the parameter order.
Attack path: An attacker only needs to use `addLiquidityETH` (tokens are transferred first) or call `addLiquidity(Token, WBNB, ...)` to transfer tokens to the pair before WBNB. When the attack occurs, WBNB has not yet arrived, `balanceOf == reserve`, and the detection function returns false, thus completely bypassing the "no add liquidity" restriction.
The root cause is that the detection method based on Pair balance snapshots is inherently unable to reliably distinguish between swap and add liquidity operations at the design level. This is an architectural flaw rather than an implementation bug.
Recommended fix: Change the system to prohibit non-whitelisted addresses from directly transferring funds to Pair. All transactions should be completed through built-in functions in the contract, thus eliminating the fundamental flaw in the balance snapshot detection at the architectural level.
This rule is not a simple labeling of a single code pattern, but a systematic analysis of a type of attack: how the triggering conditions are formed, what path the attacker takes to bypass the detection, at what stage the detection mechanism has architectural flaws, and at what level the fix needs to be implemented.
2.2 Scope of Knowledge Base
Beosin has developed a specialized skill vulnerability database covering mainstream Web3 technology stacks, including categories such as Solidity, Rust, Motoko, FunC, Go, and ZK . Its core content is kept internal and not publicly disclosed. The directory structure is as follows:

Skills within each specialized repository are managed separately according to vulnerability type. Each rule includes a number, triggering conditions, attack path reconstruction, contextual judgment logic, and remediation suggestions. The entire skill repository is continuously iterated upon with the emergence of new attack events and the accumulation of audit instances, ensuring that it remains synchronized with the real-world threat environment on the blockchain.
2.3 Comparison of baseline inspection quality after Skill intervention
To quantify the actual impact of the Skill library on baseline scan quality, we used the two test cases in Chapter 2 as benchmarks, ran the general AI and the Skill-enhanced AI on the same codebase respectively, and compared the results item by item.
Case A: Comparison Results of Standard Token Contracts (BEP-20)

Case B: Comparison Results of Complex Business Contracts (IPC Protocol)

The comparison results show that the introduction of Skill significantly improved the detection quality for both types of contracts. In standard token contract scenarios, the addition of business context judgment capabilities completely eliminated high-risk false positives. In complex business contract scenarios, the coverage of known vulnerability patterns increased from 11% to 44%, the false positive rate decreased from approximately 55% to approximately 30%, and the accuracy of severity level judgment also improved significantly. This report can serve as a baseline check, helping project teams to identify potential defects in their code in advance. Although these issues will not directly cause financial losses in the short term, they still have important positive effects on subsequent project maintenance and upgrades.
However, the data also clearly exposes the inherent limitations of AI capabilities: even with Skill enhancements, the coverage of High-level vulnerabilities in complex contracts only reaches 44% . Deep vulnerabilities requiring cross-contract state path reasoning, economic incentive model analysis, or specific temporal conditions to trigger are still far beyond the capabilities of AI baseline scanning. This is the fundamental reason why we still retain a complete manual auditing process in our audit workflow even after introducing Skill enhancements.
2.4 White Paper as Audit Input: Verification of Consistency between Code Implementation and Design Intent
In addition to the vulnerability signature database, we have added an important capability to our audit process: to use the project's white paper as an additional input, allowing AI to verify the consistency between the code implementation and the white paper design .
Specifically, before the code audit begins, AI systematically analyzes the project's white paper, technical specifications, and requirements documents, extracting role and permission models, core business processes, trust boundary definitions, and expected behavioral constraints to form a structured project context summary. Subsequently, throughout the code audit process, AI continuously cross-references this context for comparison. This mechanism has yielded two valuable results in practical use:
First, for permission structures in the code that seem to pose a risk, if the white paper has clearly explained their design intent and constraints, the AI will adjust its judgment accordingly, thereby effectively reducing such false alarms.
Second, if there are significant discrepancies between the code implementation and the promises in the white paper—for example, if the slippage protection mechanism claimed in the documentation is not implemented in the code, or if the time window constraints of the governance process are not correctly executed—the AI will issue an alert. These kinds of code-documentation inconsistencies are easily overlooked during routine code scanning, but they are often potential security vulnerabilities. This also helps project teams avoid situations where the project exhibits behavior inconsistent with their expectations after actual deployment.
III. Triple Audit Model: Collaboratively Building a Complete Security Guarantee for Smart Contracts
Once smart contracts are deployed on the blockchain, the consequences of any vulnerabilities are often irreversible. Beosin uses in-depth human auditing combined with formal verification as the foundation of its contract auditing, focusing on identifying and reporting issues that could potentially lead to financial losses or abnormal logic operation. Simultaneously, we've introduced enhanced AI baseline checks based on a dedicated Skill knowledge base to help clients discover code problems that are currently just defects but haven't yet caused actual harm. Building on this, Beosin has constructed a three-tiered auditing model: in-depth human auditing, formal verification, and enhanced AI baseline checks. Through the layered collaboration of these three elements, a more comprehensive security system is formed .
3.1 In-depth manual auditing and formal verification: the core pillars of security assurance
The core advantage of manual auditing lies in its deep understanding of the overall protocol design and its proactive analysis of potential risks from an attacker's perspective . Experienced auditing experts are responsible for conducting comprehensive protocol-level audits of projects, including verification of cross-contract interaction logic, attack surface analysis of fund security, logical analysis of the protocol under extreme market conditions, and identification and judgment of new attack methods. This protocol-level understanding of attack and defense relies heavily on long-term accumulation and practical experience in the Web3 ecosystem, which cannot currently be accomplished independently at the tool level.
Building upon this foundation, Beosin utilizes an internal toolchain to transform the judgments of manual audits into quantifiable mathematical guarantees. For core business logic confirmed by audit experts, such as the highest-risk critical paths like fund flows and price calculations, Beosin deeply integrates LLM-driven formal specification generation capabilities into its internal verification toolchain, constructing a closed-loop engine of "AI specification generation → formal exhaustive verification → counterexample-driven refinement." The toolchain first uses Beosin's accumulated audit corpus as a knowledge base to model the attack surface of manually confirmed high-risk paths, assisting in generating an initial candidate set of formal invariants and security attribute specifications. Subsequently, the automatic formal verification engine exhaustively verifies the complete state transition space of the contract. When the verification engine discovers a counterexample, the system automatically distinguishes between two scenarios: if the counterexample stems from a deviation between the specification definition and business semantics, the counterexample context is fed back to the AI module for specification refinement, driving the next iteration; if the counterexample corresponds to a genuine exploitable path in the contract code, it is directly output as vulnerability evidence, accompanied by a complete attack path reproduction, for audit experts to confirm and follow up on remediation. The two paths work together to drive the closed-loop convergence until it is mathematically confirmed that the target property holds true for all possible inputs. The critical path verified by this closed-loop mechanism constitutes the most deterministic defense in the entire contract security system, compressing the attack surface to an extremely narrow range.
3.2 Enhanced AI Baseline Checks: Continuous Risk Warning Service for Developers
Meanwhile, Beosin also offers enhanced AI baseline checks based on the Skill knowledge base as a standalone service . Unlike in-depth human audits that focus on discovering high-risk vulnerabilities, this service is positioned more like a code health report for the development team. AI baseline scanning provides full coverage of contract code , systematically identifying potential issues that won't directly cause economic losses at present but require developers' attention during subsequent project maintenance and iterations. Examples include: the use of outdated dependency libraries, missing critical event declarations, non-best practice state variable exposure methods, and Gas usage patterns that can be further optimized. These issues are usually not directly exploited by attackers under the current business logic, but as protocol functionality expands, code is refactored, or external dependencies are updated, some of these issues may gradually evolve into real security vulnerabilities. These three layers, each with its own focus and progressing step-by-step, together construct a complete security protection system for Web3 projects.




