Can a small model really detect security vulnerabilities detected by Claude Mythos? AISLE: The moat lies in the system, not the model.

This article is machine translated
Show original

Anthropic released the unreleased Claude Mythos Preview this week and simultaneously launched Project Glasswing, a project involving 12 technology companies including Amazon, Apple, Microsoft, CrowdStrike, and Cisco, to conduct defensive cybersecurity research using the model.

Mythos's claim that it can autonomously identify thousands of zero-day vulnerabilities in every major operating system and browser suggests that a new era of AI-driven cybersecurity defense is about to begin.

However, less than a week later, AISLE, a cybersecurity startup co-founded by former DeepMind and Anthropic researcher Stanislav Fort, published a systematic report on the company's technical blog.

The core conclusion is straightforward: In Mythos' flagship demonstration task, a small open-source model with only 3.6B active arguments and a cost of $0.11 per million tokens achieved the same vulnerability detection results.

What does Mythos demonstrate, and what does the small model reproduce?

AISLE designed three sets of tests, each corresponding to cybersecurity tasks of different difficulty and nature.

The first group is the OWASP (Open Web Application Security Project) false positive test.

In other words, a piece of Java SQL query code looks like an SQL Injection attack, but it is actually logically safe. The correct answer is no, it's not a vulnerability.

The test results showed a near-reverse scaling effect: the small open-source model GPT-OSS-20b (3.6B active tokens, $0.11/M tokens) correctly tracked the program logic and was deemed harmless.

Conversely, Claude Sonnet 4.5, all GPT-4.1/5.4 series (except o3 and pro), and the entire Anthropic series up to Opus 4.5 confidently misjudged them as high-risk vulnerabilities. Only a very few top-tier models—o3, OpenAI-pro, Sonnet 4.6, and Opus 4.6—answered correctly.

The second group is the FreeBSD NFS vulnerability, namely CVE-2026-4747, which was specifically showcased in the Mythos flagship release. It is a 17-year-old unauthorized remote code execution vulnerability.

Results: All 8 tested models successfully detected the overflow, including the small model with 3.6B active arguments. All models correctly identified stack buffer overflow, calculated the remaining space, and rated it as Critical RCE.

AISLE concluded that such detection capabilities have been "commoditized".

The third group is the OpenBSD SACK vulnerability (27 years old), which requires real mathematical reasoning: tracing the multi-step logical chain of a signed integer overflow.

The difficulty increased significantly, and the performance of the models diverged. GPT-OSS-120b (5.1B active argument) fully reproduced the exploit chain and was rated A+ by AISLE; the Kimi K2 open-source version received A-; while Qwen3 32B gave the erroneous conclusion that "the code is very robust" and was rated F.

Even on this more challenging task, a low-cost open-source model still achieved the same level of performance as the flagship system.

Why doesn't a larger model necessarily mean a more secure system?

The real argument of this report is not that "small models are sufficient," but that the structure of AI cybersecurity capabilities is far more complex than outsiders imagine.

AISLE breaks down the cybersecurity AI pipeline into five independent sub-tasks:

  • Broad scanning
  • Vulnerability detection
  • Triage and validation
  • Patch generation
  • Exploit construction

Each subtask has a different scaling nature and requires different model capabilities. Mythos's announcement integrates these five levels into a complete system, but in reality, their model requirements vary greatly. Some subtasks are fully saturated at 3.6B arguments, while others require complex reasoning capabilities.

This echoes the "Jagged Frontier" concept proposed by Harvard Business School researchers Dell'Acqua and Mollick in 2023: the boundary of AI capabilities is not a smooth curve, but a jagged, uneven surface that far surpasses humans in some tasks, but is unexpectedly vulnerable in adjacent tasks.

The study shows that if users deploy AI within their capabilities, productivity increases by about 40%; however, if they rashly extend it beyond the boundaries, performance actually decreases by 19%.

Within this framework, AISLE put forward a more operational inference: "A thousand competent detectives searching everywhere can discover more loopholes than a genius detective guessing where to look."

Deploying a large number of low-cost models for broad-spectrum scanning may be more effective overall than carefully scheduling a single high-cost model. AISLE stated that it has been implementing a vulnerability discovery system on real targets since mid-2025: 15 CVEs were found in OpenSSL (12 of which were found in a single security release and CVSS 9.8 Critical), 5 in curl, and more than 180 externally verified CVEs across more than 30 projects.

Where is the moat, and where is it not?

This analysis is neither a complete critique nor a mere endorsement of Anthropic.

AISLE explicitly stated that the significance of Mythos lies in proving that the category of "AI cybersecurity" is real; it's not just a concept in a demonstration lab, but a system that can operate on real-world targets. What Anthropic is doing is maximizing the "intelligence density per token," which still has irreplaceable value in tasks requiring deep reasoning.

But AISLE also pointed out a more fundamental issue for the entire industry: the moat lies in the system, not in the model itself.

In the field of cybersecurity, AISLE believes that architectural design that embeds deep expertise, such as how to decompose tasks, how to schedule models with different costs among subtasks, and how to maintain maintainer trust in a production environment, is the real source of differentiation.

A system that can find CVSS 9.8 vulnerabilities in OpenSSL requires more than just a stronger model; it requires completely different engineering logic to detect vulnerabilities in a controlled demonstration.

In summary, AISLE's report found that cheaper, more open models can already reproduce some of its core features. The real issue may not be whose model is the strongest, but rather who can get the architecture for these five sub-tasks working in a production environment first.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
86
Add to Favorites
16
Comments