Agents need both "fuel gauges" and "brakes": A research paper exposes the murky secrets of agents.

This article is machine translated
Show original

Code Fix

Imagine this scenario:

You ask an AI agent to fix a code bug. It opens the project, reads 20 files, makes some changes, runs a test, fails, makes more changes, runs it again, still fails... After going through this process more than ten times, finally—it still hasn't fixed it.

You shut down your computer and breathed a sigh of relief. Then you received the API bill.

The numbers above might make you gasp – when an AI agent autonomously fixes bugs on official overseas APIs, a single unfixed task can burn through millions of tokens, with costs ranging from tens to over a hundred US dollars.

In April 2026, a research paper jointly published by Stanford, MIT, the University of Michigan, and others systematically opened the "black box" of AI agents' consumption in coding tasks for the first time—where exactly the money was spent, whether it was worthwhile, and whether it could be predicted in advance. The answers were shocking.

Finding 1: The rate at which agents burn through code is 1000 times faster than that of ordinary AI dialogue.

You might think that the cost of having AI write code for you and having AI talk to you about code should be about the same, right?

The paper presents a comparison showing:

The token consumption for agentic coding tasks is about 1,000 times that of ordinary code question answering and code reasoning tasks.

The difference is a full three orders of magnitude.

Why is this the case? The paper points out a fact— money is not spent on “writing code”, but on “reading code”.

The "reading" here doesn't refer to humans reading code, but rather the Agent continuously feeding the model the entire project context, historical operation records, error messages, and file contents during the workflow. Each additional round of dialogue lengthens this context; and the model is charged based on the number of tokens —the more you feed it, the more you pay.

To give an analogy: it's like hiring a repairman who makes you read the entire building's blueprints to him before he even moves a wrench— the cost of reading the blueprints is far more expensive than the cost of tightening a screw.

The paper summarizes this phenomenon in one sentence: what drives the cost of the Agent is the exponential growth of the input tokens, not the output tokens.

Finding 2: For the same bug, running it twice can result in a cost difference of up to 100%—and the more expensive the bug, the less stable it is.

What's even more troublesome is the randomness.

Researchers ran the same agent on the same task four times and found that:

  • Among different tasks, the most expensive task burns approximately 7 million more tokens than the cheapest task (Figure 2a).
  • In multiple runs of the same model and the same task , the most expensive run was approximately twice as expensive as the cheapest run (Figure 2b).
  • However, when comparing the same task across different models , the highest and lowest resource consumption can differ by as much as 30 times.

The last figure is particularly noteworthy: it means that the cost difference between choosing the right model and choosing the wrong model is not "a little more expensive", but "an order of magnitude more expensive".

What's even more disheartening is that spending a lot doesn't necessarily mean doing a good job.

The paper discovered an inverted U-shaped curve:

Code Fix

Accuracy Trends Based on Cost Levels: Low-cost: Low accuracy (potentially insufficient investment) ; Medium-cost: High accuracy (often highest); High-cost: Accuracy declines instead of increasing, entering a "saturation range".

Why does this happen? The paper provides the answer by analyzing the specific operations of the agent—

In its high-cost operation, the agent spends a significant amount of time on repetitive tasks.

Research has found that in high-cost operations, about 50% of file viewing and file modification operations are repetitive —that is, the agent repeatedly reads the same file and repeatedly modifies the same line of code, like a person spinning in circles in a room, getting more and more dizzy, and the more dizzy they get, the more they spin.

The money wasn't spent on solving the problem, but on getting "lost".

Finding 3: The "energy efficiency ratio" varies drastically among models—GPT-5 is the most energy-efficient, while some models burn 1.5 million more tokens.

The paper tested the agent performance of eight cutting-edge large-scale models on the industry-standard SWE-bench Verified (500 real GitHub Issues). Converted to US dollars, the difference between a more efficient token-based model and a less efficient one is only a few tens of dollars more per task. In enterprise-level applications—running hundreds of tasks a day—this difference translates into real money.

An even more interesting finding is that token efficiency is an "inherent characteristic" of the model, rather than a result of the task.

Researchers compared tasks that all models solved successfully (230 tasks) with tasks that all models failed on (100 tasks), and found that the relative rankings of the models remained almost unchanged.

This shows that some models are naturally "talkative," which has little to do with the difficulty of the task.

Another thought-provoking finding is that the model lacks "stop-loss awareness".

When faced with a difficult task that no model can solve, an ideal agent should give up as early as possible, rather than continue burning money. But in reality, models generally consume more tokens on failed tasks—they don't "give up," they just keep exploring, retrying, and rereading the context, like a car without a fuel gauge warning light, driving all the way until it breaks down.

Finding 4: What humans find difficult, agents don't necessarily find expensive—a complete misalignment in difficulty perception.

You might think: At least I can estimate the cost based on the difficulty of the task, right?

The paper commissioned human experts to rate the difficulty of 500 tasks and then compared the results with the agent's actual token consumption.

Result: There is only a weak correlation between the two.

In layman's terms: tasks that humans find incredibly difficult can be easily handled by agents with minimal cost; tasks that humans consider a piece of cake can leave agents burning through cash and questioning their very existence.

This is because the difficulty of what humans and AI "see" is fundamentally different:

  • Humans look at : logical complexity, algorithmic difficulty, and the threshold for understanding the business logic.
  • The agent considers : the size of the project, the number of files to be read, the length of the exploration path, and whether the same file will be modified repeatedly.

A bug that a human expert thinks can be fixed by "changing one line" might require an agent to understand the entire codebase structure to pinpoint that line—and just "reading" it would consume a large number of tokens. On the other hand, an algorithm problem that a human might find "logically convoluted" might be solved by an agent that happens to know the standard solution, allowing it to be fixed in no time.

This leads to an awkward reality: it is almost impossible for developers to intuitively estimate the operating cost of an agent.

Finding 5: Even the model itself couldn't accurately calculate how much it would cost.

Since humans can't predict accurately, why not let AI make the predictions itself?

Researchers designed a clever experiment: before actually starting to fix the bug, the agent "inspects" the codebase and estimates how many tokens it will need to consume—but does not actually perform the fix.

What was the result?

All models were wiped out.

The best result was achieved by Claude Sonnet-4.5, which had a prediction relevance of 0.39 (out of 1.0) for the output token. Most models had a prediction relevance between 0.05 and 0.34, with Gemini-3-Pro having the lowest at only 0.04 – essentially a guess.

Even more absurdly, all models systematically underestimated their token consumption. In the scatter plot of Figure 11, almost all data points fall below the "perfect prediction line"—the models thought they "wouldn't spend that much," when in fact they spent much more. Moreover, this underestimation bias is even more pronounced without examples provided .

Even more ironically, the prediction itself costs money.

The prediction costs of Claude Sonnet-3.7 and Sonnet-4 are even more than twice the cost of the task itself. In other words, getting them to "estimate" the cost is more expensive than actually doing the work.

The paper's conclusion is straightforward:

Currently, cutting-edge models cannot accurately predict their own token usage. Clicking "Run Agent" is like opening a blind box—you only find out how much you spent when the bill comes out.

Behind this murky accounting lies a much larger industry problem.

Having read this far, you might ask: What do these findings mean for businesses?

1. The "monthly subscription" pricing model is being cracked by agents.

The paper points out that subscription models like ChatGPT Plus are feasible because the token consumption of ordinary conversations is relatively controllable and predictable. However, agent tasks completely shatter this assumption—a single task can burn through a massive amount of tokens because the agent gets stuck in a loop.

This means that pure subscription pricing may not be sustainable for agent scenarios , and pay-as-you-go will remain the most realistic option for a considerable period. However, the problem with pay-as-you-go is that usage itself is unpredictable.

2. Token efficiency should be the "third criterion" for model selection.

Traditionally, companies select models based on two dimensions: capability (whether they can do it) and speed (how quickly they can do it). This paper introduces a third equally important dimension: energy efficiency (how much effort is required to accomplish it) .

A model that is slightly less capable but three times more efficient may be more economically valuable than the "most powerful but most expensive" model in a large-scale scenario.

3. The agent needs both a "fuel gauge" and "brakes".

The paper mentions a noteworthy future direction— Budget-aware tool-use policies . Simply put, it's about equipping the agent with a "fuel gauge": when token consumption approaches the budget, it forces the agent to stop ineffective exploration instead of burning through resources.

Currently, almost all mainstream agent frameworks lack this mechanism.

The "cash-burning problem" of agents is not a bug, but an inevitable growing pain for the industry.

This paper does not reveal the flaws of a particular model, but rather the structural challenges of the entire Agent paradigm —as AI evolves from "question and answer" to "autonomous planning, multi-step execution, and repeated debugging," the unpredictability of token consumption becomes almost inevitable.

The good news is that this is the first time someone has systematically brought this messy situation to the surface and settled it . With this data, developers can make more informed choices about models, set budgets, and design loss mitigation mechanisms; model vendors also have a new direction for optimization—not just making them stronger, but also more cost-effective.

After all, before AI agents truly enter the production environments of various industries, ensuring every penny is spent transparently is more important than writing every line of code beautifully. (This article was first published on the TMTPost app, author: Silicon Valley Tech news, editor: Zhao Hongyu)

Note: This article is based on the preprint paper *How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks* (Bai, Huang, Wang, Sun, Mihalcea, Brynjolfsson, Pentland, Pei) published on arXiv on April 24, 2026. The authors are from institutions including the University of Virginia, Stanford University, MIT, and the University of Michigan. This research has not yet been peer-reviewed.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
65
Add to Favorites
15
Comments