GPT-5.2 was used for 7 consecutive days, resulting in 3 million lines of code that created a Chrome-level browser.

avatar
36kr
01-15
This article is machine translated
Show original

[Introduction] How long can a large model continuously write code? An hour? A day? Or, like most AI programming tools, does the dialogue end once a task is completed? Cursor CEO Michael Truell decided to conduct an extreme stress test!

Michael Truell ran GPT-5.2 in Cursor continuously for a full week .

Not for an hour, not for a day, but for 168 hours straight, without sleep or rest, writing code continuously.

result?

3 million lines of code. Thousands of files.

AI built a completely new browser from scratch.

Moreover, it's a browser like Chrome.

HTML parsing, CSS layout, text rendering, and a self-developed JavaScript virtual machine—all written by AI itself.

Michael Truell casually tweeted: It basically works! Simple web pages render quickly and correctly.

How long can a model run?

Traditional AI programming tools, such as Github Copilot and other early IDEs, all follow a question-and-answer model.

The dialogue length is limited, the context is limited, and the task complexity is limited.

Later, so-called Agentic programming emerged—tools such as Claude Code, Cursor Agent, and Windsurf enabled AI to autonomously perform multi-step tasks, read files, run commands, and fix errors.

This is already a significant improvement, but in most cases, tasks are still measured in minutes, at most a few hours.

AI completes a function, humans review it, and then the process moves on to the next task.

But no one has ever tried running a model continuously for a week.

Until GPT-5.2.

The Cursor team kept GPT-5.2 running continuously for a full week , not intermittently .

During this week, it:

  • He has written over 3 million lines of code.
  • Thousands of files were created
  • Trillions of tokens were executed.
  • A complete browser rendering engine was built from scratch.

How long can a model actually run?

The answer is: theoretically, it can be infinite .

As long as the infrastructure is stable and the task is clear enough, AI can work continuously—without sleep, without food or drink, 24/7, all year round.

Like the "cyber black market" of the shepherd in Australia.

However, in reality, the "endurance" of different models varies greatly.

The context window is the first hurdle.

Early versions of GPT-3.5 only had 4K token context, meaning that conversations would be forgotten if they went on for too long.

Claude 3 introduced 200K context, GPT-4 Turbo followed with 128K, and Gemini 1.5 Pro even claimed to support 1 million tokens.

However, the context length is only a theoretical value—the real test is whether the model can maintain consistency, focus, and execution in long tasks.

The Cursor team discovered key differences in their experiments.

In this official Cursor blog post, the team discovered key differences in their experiments:

  • GPT-5.2 can work autonomously for extended periods, follow instructions precisely, and remain focused without deviating from its course.
  • Claude Opus 4.5 tends to end as early as possible, taking shortcuts and frequently handing control back to the user;
  • Although GPT-5.1-Codex is designed for coding training, its planning capabilities are not as good as GPT-5.2, so it is prone to interruption.

To put it more bluntly: Opus is like an impatient intern who , after working for a while, wants to ask, "Is this okay? I'll submit it now."

GPT-5.2 is like a seasoned senior engineer ; once the task is clearly explained, it buries itself in and gets to work.

This is why Cursor officially claims that GPT-5.2 is a cutting-edge model for handling long-running tasks.

Not just browsers.

Cursor also revealed other experimental projects currently in operation: JavaLSP, a Windows 7 emulator, and an Excel clone.

The data is staggering; the AI itself wrote 550,000 lines of code, 1.2 million lines of code, and 1.6 million lines of code. (By the way, Excel has even more code than Windows, in a rather exaggerated way.)

Multi-agent system collaboration

A model writes 3 million lines of code in a week, and that's non-stop writing without human intervention!

This is clearly not a model that "goes it alone". How did it do that?

The Cursor team revealed their secret weapon: the Multi-Agent System .

Initially, they tried to have all agents collaborate equally, synchronizing state by sharing files. The results showed that:

Agents may hold locks for too long or simply forget to release them. The speed of twenty agents drops to the equivalent throughput of two or three agents.

This is very similar to common problems in human teams: too many meetings, high communication costs, and unclear boundaries of responsibility.

The most effective solution is a layered architecture :

  • Planners : Continuously explore the codebase, create tasks, and make high-level decisions.
  • Workers : Focused on completing specific tasks, not concerned with the overall picture; after submission, they move on to the next task.
  • Review (Agent) : Determines whether each iteration is satisfactory and decides whether to proceed to the next stage.

This is almost the organizational structure of a human software company: product managers/architects are responsible for planning, programmers are responsible for execution, and QA is responsible for review.

But the difference is that this involves hundreds or thousands of agents working simultaneously .

The Cursor team has enabled hundreds of agents to work collaboratively on the same codebase for weeks with virtually no code conflicts.

This means that AI has learned the collaborative skills that human teams take years to develop.

Browsers have a much deeper "moat" than you think.

If you hear comments like "It's just software that displays web pages," any engineer who has worked on a browser kernel would probably just smile wryly.

In the hierarchy of computer science, the difficulty of writing a browser kernel by hand is second only to writing an operating system by hand.

To give you an idea of what 3 million lines of code means, we need to take a look at Google's Chromium (the open-source parent of Chrome).

As one of the pinnacles of human software engineering, Chromium's codebase has long exceeded 35 million lines .

It is not just software; it is essentially an "operating system disguised as an application."

What exactly is the challenge of GPT-5.2?

First, there's the "chaos theory" of CSS.

Web page layout is never a simple matter of stacking blocks.

The CSS standard is full of historical quirks, cascading rules, and complex inheritance logic.

A former Firefox browser engineer once used an analogy: implementing a perfect CSS engine is like simulating a universe where the laws of physics change at will. Changing a property of a parent element could cause the layout of thousands of child elements to collapse instantly.

Secondly, there is the "virtual machine within a virtual machine".

This time, the AI not only wrote the interface, but also a JavaScript virtual machine.

Modern web pages run JavaScript code that requires memory management, garbage collection (GC), and a security sandbox.

If not handled properly, the webpage can consume all your memory, or even allow hackers to bypass your browser and take over your computer.

The worst part is that it chose Rust.

The Rust language is known for its "uncompromising safety," and its compiler is like an extremely neurotic examiner.

When writing business logic, human engineers often spend half their time "arguing" with the compiler, dealing with borrow checks and lifecycle issues.

AI not only needs to understand the business, but also needs to be able to handle millions of lines of code without leaving any room for criticism from the "examiner".

Being able to tackle these tough challenges within seven days and make them work together is no longer just about "writing fast"; it means that the machine has begun to possess top-level architectural control.

When AI can "endure loneliness"

But the real bombshell of this news story isn't the browser itself, but rather the "Uninterrupted" message .

This is a watershed moment in the evolution of AI.

Prior to this, the AI programming tools we were familiar with (such as the early Copilot) were like this: you write a function header, and it completes five lines of code; you issue a command, and it generates a script.

Their memories are fragmented, and their attention span is short.

When a task becomes slightly more complex, such as "refactoring this module," they often focus on one aspect while neglecting the others, changing one part only to break another, ultimately requiring someone to clean up the mess.

But this time it's different. This is a victory for a "long-duration mission".

These 3 million lines of code are spread across thousands of files.

When the AI writes its 3 millionth line of code, it must still "remember" the architectural rules set in the first line of code.

When the rendering engine and the JavaScript virtual machine clash, it must be able to trace back tens of thousands of lines of code to find the source of the bug.

During those 168 hours, GPT-5.2 must have contained some bugs.

But instead of stopping to report errors and wait for human input, it reads the error logs, debugs, reconstructs, and then continues on its way.

This autonomous closed loop of "write-run-repair" was once the moat that we human engineers were most proud of.

Now, the moat has been filled in.

We are witnessing a qualitative leap in AI, from "chat companion" to "digital labor."

Previously, we instructed AI to perform "tasks," such as "write a Snake game."

Now we direct AI to do "projects," such as "creating a browser."

Spiral of Silence

Although this AI-powered browser is still a long way from being as mature as Chrome, it has proven the viability of the approach.

When computing power can be transformed into extremely complex engineering implementation capabilities, the marginal cost of software development will approach zero.

The most striking thing about this experiment wasn't the rendered webpage on the screen, but the progress bar that had been running silently in the background for seven days.

It works tirelessly and calmly, building the foundation of the digital world at a rate of thousands of characters per second.

Perhaps we should re-examine the definition of "creation".

Only when a tool begins to solve problems alone in the dead of night do we realize that it is no longer just a tool, but a companion.

From Australian man's "cyber black market work" to AI long-duration tasks

The Australian shepherd who drove Silicon Valley crazy with just 5 lines of code actually did only one thing: make AI not stop until it reached its goal.

What commands are written in Prompt.md is not the point.

Just like the extreme stress test Cursor's CEO conducted today, the goal is to create a Chrome clone, a Windows clone, and an Excel clone. As long as the goal isn't achieved, the AI will continue running. Returning to the initial question:

How long can an AI operate on its own?

The physical answer is infinity . As long as you have enough computing power, stable infrastructure, and a clear task definition, AI can run indefinitely.

But more importantly, it has changed the economics of software development.

The main costs of traditional software development are manpower and time .

Developing a complex project with a team of 10 people can take anywhere from 6 months to several years. The monthly human resource costs could range from hundreds of thousands to millions of dollars.

Now, AI can complete in a week what used to take months .

The cost may only be some token fees; Emad Mostaque (former CEO of Stability AI) speculates that the Cursor browser project may have consumed approximately 3 billion tokens.

He also had another idea: how many tokens would it take to rewrite a Windows-level operating system? And what would the cost be?

Tokens are becoming increasingly cheaper, just like water and electricity were before; eventually, computing power based on tokens will also become extremely cheap.

As a result, the economics of software will be completely overturned. For example, the practice of paying for software based on licensing may disappear.

In 2026, software development is undergoing a genetic mutation.

In the past, code was a product of humans typing it out line by line.

In the future, code may simply be the automatic unfolding of human intentions: you describe what you want, and AI can turn it into reality.

How long can a model run?

It can keep running as long as you need it .

References:

https://x.com/mntruell/status/2011562190286045552

https://x.com/leerob/status/2011565729838166269

https://cursor.com/cn/blog/scaling-agents

This article is from the WeChat official account "New Intelligence" , edited by Ding Hui Allen, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments