ChatGPT and Claude just had major updates simultaneously; those who can't be bosses to AI will be eliminated.

This article is machine translated
Show original

Just now, a "collision of Mars and Earth" occurred in the Silicon Valley AI circle.

As if by prior arrangement, OpenAI and Anthropic simultaneously released their major updates: Claude Opus 4.6 and GPT-5.3-Codex.

If before last night we were discussing how to write good prompts to assist with our work, then after today we may need to learn how to manage AI employees as bosses.

AI creates AI, and incidentally takes over your computer.

Just yesterday, Sam Altman achieved Codex's "one million active users" milestone on the X platform. A mere day later, OpenAI followed up with another bombshell—

GPT-5.3-Codex.

The technical documentation contains a very significant statement: "This is the first model that played a key role in our own creation process."

In layman's terms, this means that AI has learned to write its own code, find bugs on its own, and even begun training the next generation of AI. This self-evolutionary capability is directly reflected in a series of benchmark scores.

Remember the OSWorld-Verified benchmark that simulates human computer operation? The previous model only achieved an accuracy of 38.2%, barely passing. But this time, GPT-5.3-Codex jumped to 64.7%.

It's worth noting that the average human level is only 72%. This means that AI is just a hair's breadth away from being as adept as you are at using a mouse, switching screens, and operating software.

In Terminal-Bench 2.0 (command line operation benchmark), it achieved a high score of 77.3%, far surpassing GPT-5.2 (62.2%).

In the SWE-Bench Pro benchmark, which covers four programming languages, is not only pollution-resistant but also tackles real-world hardcore engineering challenges, GPT-5.3-Codex also demonstrated state-of-the-art performance, using fewer tokens than any previous model.

OpenAI even demonstrated its ability to build independently:

Within days, it built a racing game v2 with multiple maps from scratch, and also managed to create a deep-sea diving game that manages an oxygen system.

What impressed me most was GPT-5.3-Codex's understanding of ambiguous intent.

When building the landing page, it automatically converts the annual plan into a discounted monthly price, and even thoughtfully adds a user review carousel— all without you having to give any instructions.

OpenAI's ambitions are written all over its face: Microsoft used to say that AI would become the copilot of humans, but now AI wants to be the driver who can take control of the steering wheel and even fix the car itself.

Oh, and there's one more interesting detail.

Previously, it was widely rumored that OpenAI had reservations about NVIDIA's AI chips, but this time the official blog specifically emphasized that the design, training and deployment of GPT-5.3-Codex were all completed on the NVIDIA GB200 NVL72 system.

This high-EQ "thank you Nvidia" really gave Huang Renxun a lot of face.

Saying goodbye to "goldfish memories," Claude staged a dramatic comeback.

Around the same time as the release of GPT-5.3-Codex, Anthropic also presented its own Chinese New Year gift package.

The bad news is that the much-anticipated "medium" Claude Sonnet model hasn't been updated; but the good news is that Anthropic has directly presented the "super-sized" version – Claude Opus 4.6.

Compared to OpenAI's aggressive approach to action, Anthropic's Claude Opus 4.6, released today, focuses on critical thinking and reliability.

Many enterprise users have a pain point called Context Rot: it claims to support 200k contexts, but when a lot of data is put in, AI starts to focus on the beginning but not the end.

This time, the data presented by Claude Opus 4.6 is simply a "game-changer".

In the MRCR v2 (Long Text Needle in a Haystack) test, Claude Opus 4.6 achieved a recall rate of 76%.

In contrast, the previous generation Sonnet 4.5 had a dismal 18.5%. In a sense, this represents a qualitative leap from being virtually unusable to highly reliable.

This is because Claude Opus 4.6 introduced a truly usable 1M context window for the first time.

What does this mean? It means you can throw hundreds of pages of financial reports or hundreds of thousands of words of code directly at it, and it can not only read them all, but also accurately tell you that there is a problem with the number in the footnote on page 342.

Furthermore, it now supports output tokens up to 128k. What does that mean? It means you can have it write a long research report or a complex codebase all at once, without being forced to truncate it due to word limits.

Besides having a good memory, Opus 4.6 has also achieved a crushing victory in terms of intelligence:

In GDPval-AA (an assessment for high-economic-value tasks such as finance and law), Opus 4.6's Elo score is a full 144 points higher than the industry's second-best (OpenAI's GPT-5.2), and a whopping 190 points higher than its predecessor.

In the complex multidisciplinary reasoning test Humanity's Last Exam, it outperforms all cutting-edge models.

It also performed best in BrowseComp, a test of the ability to find "hard-to-find information" on the internet.

Through this data, Anthropic seems to be sending a signal: if you need to write code, go to OpenAI next door; if you need to handle complex business decisions, legal documents, or financial analysis, Claude is the only choice.

What really caught the eye of working people was its productivity function.

On one hand, Anthropic has now integrated Claude directly into Excel and PowerPoint. It can generate PPTs directly from Excel data, preserving not only the layout style but also aligning fonts and templates. In the Claude Cowork collaboration environment, it can even perform autonomous multitasking.

On the other hand, Anthropic took the opportunity to launch an experimental Agent Teams feature in Claude Code, allowing ordinary developers to experience the feeling of "commanding thousands of troops":

Role division: You can designate a Claude Session as the Team Lead, which does not do the dirty work and is specifically responsible for breaking down tasks, assigning work orders, and merging code; the other Sessions are teammates (Teammates), each taking on tasks to do.

Independent operation: Each teammate has an independent context window (no need to worry about token explosion), and they can even send messages to each other behind your back (inter-agent messaging) to discuss technical details, and finally only report the results to the team leader.

Parallel horse racing: What's the use of this? Imagine checking for a stubborn bug. You can generate 5 agents to verify 5 different hypotheses, like a "horse race" to clear the mine in parallel; or during code review, you can have one teammate act as a "security expert" to check for vulnerabilities, and another act as an "architect" to check performance, without interfering with each other.

To demonstrate the limits of Opus 4.6, Anthropic researcher Nicholas Carlini conducted a crazy experiment: Agent Teams.

Instead of writing the code himself, he threw out $20,000 in API credits, allowing 16 Claude Opus 4.6 users to form a "fully automated software development team".

In just two weeks, this group of AIs autonomously conducted more than 2,000 programming sessions and wrote a C language compiler (based on Rust) with 100,000 lines of code from scratch.

This AI-written compiler also successfully compiled the Linux 6.9 kernel (covering x86, ARM, and RISC-V architectures) and even ran the Doom game.

While it's not perfect (for example, the generated code isn't as efficient as GCC), this case demonstrates that we're no longer programming with AI, but rather watching an AI team autonomously collaborate, debug, and advance the project.

In addition, it has learned Adaptive Thinking, which allows it to decide "how long to think" based on the difficulty level. With the addition of a new "intelligent intensity" control, you can switch between four levels, from Low to Max.

In terms of pricing, Anthropic has been quite generous this time, maintaining a base price of $5/$25 per million tokens. It seems determined to compete head-to-head with OpenAI in the enterprise market.

One is a radical genius, the other is a reliable old cow.

Renowned AI reviewer Dan Shipper conducted a blind test (Vibe Check) immediately, and his evaluation was remarkably accurate:

Claude Opus 4.6 is characterized by "High Ceiling, High Variance".

It's like a brilliant but occasionally eccentric genius. In testing, it directly solved a feature problem that had stumped the iOS team for two months; it achieved a high score of 9.25/10 in the LFG Benchmark.

But it can also be "overconfident" at times, spouting nonsense with a straight face. If you need a breakthrough inspiration, choose it.

GPT-5.3-Codex is characterized by "High Reliability, Low Variance".

It's like a seasoned, reliable engineer who never lets you down. Reasoning speed is improved by 25%, it makes almost no basic mistakes, and its stability is reassuring.

While it lags slightly behind in creative tasks (LFG score 7.5/10), it is the most efficient workhorse in daily coding and maintenance tasks.

Of course, more important than choosing which model to use is that when ChatGPT can fix bugs and even operate your terminal autonomously, and when Claude can process massive amounts of documents at once and accurately locate details, the importance of Prompt Engineering is declining, while the ability of Agent Management is beginning to emerge.

We no longer need to break down instructions into minute details like teaching a primary school student. Instead, we need to learn how to define goals, review results, and decide when and how to assign which task to which AI employee, acting as a manager.

This is the new workplace in 2026. Your team is infiltrated by a group of silicon-based geniuses, and you are the only carbon-based boss.

This article is from the WeChat official account "APPSO" , authored by Discover Tomorrow's Products, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments