After feinting with Mythos, Anthropic unexpectedly brought up the Claude Opus 4.7 .
Many of you stayed up all night playing games like crazy!
I jumped out of bed and started surfing the internet while trying out Opus 4.7. I learned some bad news and some good news from it.
Let's start with the bad news—Opus 4.7 looks a bit like an old friend.
He always wants to "catch me safely".
Many users have also reported that although it's supposed to be an upgrade, Opus 4.7 seems to be getting more and more like GPT the more you use it.
If it's true, this is not a good thing (helplessly closes eyes.jpg).
There's only one piece of bad news, but there's a ton of good news.
It is better than its predecessor in many aspects, including agent coding, agent terminal coding, scaled tool use, and visual reasoning. However, it has declined in a few individual capabilities, such as agent search.
Anthropic also arrogantly stated:
Opus 4.7 is currently our most powerful publicly available model. However, it's not our most powerful model!
It seems that the most powerful one is still Mythos, who keeps his attacks hidden.
Looking at the table above, Mythos outperformed by about 10% to 15% in all those tests.
Without a doubt, Mythos Preview is Anthropic's strongest card right now, with maximum capabilities, but it also costs five times as much as Opus 4.7.
In comparison, Opus 4.7 is more like the most powerful mass-produced version with a fully verified security system, affordable pricing, and open access across all platforms.
But... even the wisest person can make a mistake.
Despite its powerful performance, Opus 4.7 suffered a setback yesterday:
Claude Opus 4.7 Surprise Attack: Four Core Upgrades
Overall, this publicly available, most powerful Opus 4.7 performs exceptionally well in all four areas.
Advanced Software Engineering: A Trustworthy Partner
Opus 4.7’s most significant advancements are in the field of advanced software engineering.
Let's look at this set of data:
The SWE-bench Verified test score reached 78.2%;
SWE-bench Multimodal achieved 72.7%;
Terminal-Bench 2.0 scored 68.8%;
The number of production tasks resolved in Rakuten-SWE-Bench is three times that of Opus 4.6;
The coding benchmark for 93 tasks on GitHub also improved by 13%.
Cursor CEO Michael Truell offered a key assessment:
On CursorBench, Opus 4.7 jumped from 58% to 70%, a significant leap.
This improvement is reflected in three key characteristics.
First, strictly follow the instructions.
Opus 4.7 no longer "flexibly interprets" the user's vague expressions as in earlier models, but instead executes them literally.
This means that previously, if you wrote a suggestion like "Try to optimize this code if possible," the model might selectively ignore it.
Now, if you say "optimize this code," it will definitely be executed.
This change requires users to readjust their prompting strategies, with soft modifiers such as "if possible/ideally/try to" receiving more weight, and hard restrictions needing to be more explicit.
Second, self-verify before outputting.
Opus 4.7 devises ways to verify its own outputs before reporting results, much like a senior engineer runs tests before committing code.
Third, they are skilled in complex multi-file changes, fuzzy debugging, and cross-service code review.
Sarah Sachs, AI Lead at Notion, shared some data:
For complex, multi-step workflows, Opus 4.7 offers a 14% performance improvement over Opus 4.6, with lower token consumption and only one-third of the tooling errors. It was the first model to pass our implicit requirements test.
Visual capabilities: Resolution x3, see more details
Opus 4.7 also shows significant improvement in visual capabilities.
Official data shows that the longest side supports a maximum of 2576 pixels (≈3.75 megapixels), which is more than 3 times that of Opus 4.6; XBOW's visual acuity reaches 98.5% (Opus 4.6 only 54.5%).
It covers almost all real-world application scenarios, can directly recognize complete Figma design drafts and 1080p terminal screenshots (including small gray text), accurately analyze complex technical architecture diagrams and financial charts, and can clearly read high-density UI elements in computer use scenarios, with visual processing capabilities that are almost perfect.
In other words, tasks that previously required specialized models, such as chemical structure analysis, recognition of complex technical charts, and pixel-level precise positioning of UI elements, can now be handled by a single module in Opus 4.7.
Figma's stock price plummeted immediately upon hearing this; the situation was utterly disastrous.
Instruction compliance and reasoning: more controllable and more reliable
Opus 4.7 has also made significant progress in command compliance.
It no longer tries to guess the user's true intentions, but instead strictly follows the literal meaning.
The core advantage of this upgrade lies in its strict literal execution. If the user requests "don't use TypeScript", the model will resolutely not use it; if the user requests "output JSON", the output will definitely not have any extra prefixes.
This change may require some adjustment for long-time users (and the old prompts are prone to unexpected results, requiring recalibration), but it is a boon for scenarios that require precise control.
In terms of reasoning, it performed exceptionally well in scenarios with a long context of 1 million tokens, achieving a BFS task score of 58.6%* (compared to 41.2% for Opus 4.6), demonstrating a significant improvement in logical coherence in complex reasoning.
Agent Enhancement: A Version Built for Agents
If the previous Claude was designed for dialogue, Opus 4.7 is designed for agents.
This is reflected in several aspects.
Overall, Opus 4.7's core agent capabilities have been comprehensively improved.
Several well-known AI companies have presented data on the actual usage effects—Notion's multi-step workflow success rate increased by 14%, and the tool call error rate was reduced to 1/3; in the Vending-Bench 2 long-term business simulation, the final balance reached $10,937 (Opus 4.6 had $8,018 left), making long-term decision-making more robust; in the Genspark scenario, the three production-grade features of anti-infinite loop, consistency, and error recovery were fully utilized.
It also features file system memory, reliably remembers key information across multiple sessions, and reduces repetitive context input by 40% for new tasks.
Cognition CEO Scott Wu's description is even more vivid:
Opus 4.7 takes long-cycle autonomy to a new level in Devin. It can work continuously for hours, tackling challenges instead of giving up, unlocking a class of in-depth investigative tasks that we couldn't reliably run before.
At the same time, Opus 4.7 also provides developers with a fantastic suite of Agent-related features.
First, a new xhigh inference level has been added, which serves as the default level between high and max.
This gives developers more granular control, allowing them to find a balance between inference depth and latency, balance intelligence and token cost, and adapt to most coding/agent tasks.
Second, a new adaptive thinking mode has been added to replace long thinking with a fixed budget. The model autonomously determines the depth of thinking, provides quick responses to simple queries, and focuses on key tasks for complex steps.
Third, the task budget (public beta) allows developers to guide token consumption and optimize resource allocation for long tasks.
Fourth, Claude Code has added the /ultrareview command, which allows you to create a dedicated review session and mark minor errors and design issues.
I want to create a reliable model: initial protection and memory enhancement.
Anthropic officials stated that Opus 4.7's cybersecurity capabilities are inferior to Mythos Preview.
However, this was something they did intentionally.
Behind this "self-imposed limitation" lies Anthropic's consistent commitment to AI safety.
Since its founding in 2021, the company has spent four years carefully building its reputation, attempting to cultivate an image of being "more focused on safe and responsible AI deployment than competitors like OpenAI."
After Mythos Preview sparked heated discussions in the industry about the security risks of powerful AI models, Opus 4.7 was designed as a buffer.
Specifically, Anthropic experimented with differentially reducing the network capabilities of Opus 4.7 during training, allowing the model to exhibit a more cautious behavior when facing cybersecurity-related tasks.
At the same time, the official team released protective measures to automatically detect and block high-risk cybersecurity requests. These safeguards can automatically identify and block requests that indicate prohibited or high-risk cybersecurity purposes.
For professionals with legitimate cybersecurity needs, Anthropic has launched the Cyber Verification Program.
Security professionals who wish to use Opus 4.7 for legitimate purposes such as vulnerability research, penetration testing, and red team exercises can apply through official channels.
The official website also noted at the end of the podcast that if developers want to migrate from Opus 4.6 to version 4.7, there are some things they need to pay special attention to.
First, there's an update to the tokenizer.
Opus 4.7 uses a new tokenizer, which improves text processing efficiency, but the same input may map to more tokens, approximately 1.0 to 1.35 times more.
This means that the same prompts may consume more tokens, so a margin needs to be set aside in the cost budget.
Secondly, higher effort levels will generate more output tokens.
Opus 4.7 significantly increases the depth of thought at the high and xhigh levels, especially in the later stages of multi-turn dialogues in Agent scenarios.
This "more thoughtful, more reliable" behavior pattern improves output quality, but it also means that token consumption will increase with the length of the session.
Priced the same as Opus 4.6, here are some things you need to know.
Opus 4.7 is now available on all platforms.
In addition to Claude's official channels, the new model is available on all Claude Pro/Max/Team/Enterprise products and official APIs, as well as on three major cloud platforms: Microsoft Foundry, Google Cloud Vertex AI, and Amazon Bedrock.
Its pricing is consistent with Opus 4.6: $5 per million tokens for input and $25 per million tokens for output.
Although, as mentioned earlier, Opus 4.7 involves the need to refactor prompts and adjust token usage strategies, Anthropic has given positive signals in its internal testing.
In an internal agent coding evaluation, token usage efficiency improved compared to Opus 4.6 across all effort levels.
In other words, although the number of tokens per call may increase, the total number of tokens required to complete the task is often less because the number of times the model makes mistakes is reduced.
It's like hiring a senior engineer with a higher hourly rate, but he completes the task faster, does less rework, and ultimately the total cost may be lower.
In addition, Opus 4.7 will be more cautious in later rounds, especially in Agent scenarios.
This means more reliable output, but it also means more token consumption.
Developers can balance performance and cost by adjusting the effort parameter, setting task budgets, or optimizing prompts.
Anthropic recommends starting with the high or xhigh effort level when testing Opus 4.7's coding and agent use cases, and gradually adjusting as needed.
Anyway~
In general, the actual cost of use will vary depending on the way it is used, but in most cases, the efficiency gains from improved capabilities will offset the increase in token consumption.
This could be a worthwhile deal for teams that rely on Claude for complex development work.
Reference link:
[1]https://www.anthropic.com/news/claude-opus-4-7
[2]https://www.cnbc.com/2026/04/16/anthropic-claude-opus-4-7-model-mythos.html
[3]https://x.com/i/trending/2044560325509316766
This article is from the WeChat public account "Quantum Bit" , author: Heng Yu, and published with authorization from 36Kr.






