Over the past few years, more efficient and improved infrastructure has been driving down the cost of AI tokens, with everyone racing to lower prices.
However, with the recent surge in popularity of phenomenal applications like OpenClaw that possess powerful agent capabilities, API (Application Programming Interface) bills have defied the trend and skyrocketed. In addition to the massive amount of context stacking brought about by the agent's operation itself, there is a hidden money-devouring beast behind it all: the increasingly long and even out-of-control "Chain-of-Thought" (CoT).
Since OpenAI's o1 model revolutionized test-time compute, the idea that thinking longer leads to better performance has become a panacea for general artificial intelligence. Today, when we call flagship inference models, their background thinking time has indeed increased exponentially, often churning out thousands of words of internal monologue. OpenAI revealed in its January 2025 earnings call that the average token cost per request for the o1 series is 2.7 times that of GPT-4o, and on some programming tasks, this multiple can reach five times or even higher.
And this trend shows no signs of stopping. For example, the newly released GPT 5.4 Pro took 5 minutes and 18 seconds, costing $80, to answer a "Hi" greeting.
Are such long thought chains truly useful? When are they actually useful? How can we make the model think less and more effectively? This question has plagued researchers since the inception of O1. Interpretations and solutions have emerged, but none have fully addressed the issue of selecting effective thought tokens. To date, the mainstream approach in the industry remains routing, where the model determines for itself whether thinking is necessary.
In February 2026, a paper from Google titled "Think Deep, Not Just Long" offered a more fundamental solution.
Simply put, to see if a model's thinking is useful, you need to see how deep its thinking is.
01 More is not necessarily better
The advent of Chain-of-Thought actually predates GPT. In 2022, two papers published by Google researchers almost simultaneously established CoT as the paradigm for reasoning. The first paper, "Chain-of-Thought Prompting," demonstrated that by adding reasoning chains to few-shot examples, large models could achieve significant improvements in tasks such as arithmetic, common sense, and symbolic reasoning. Under certain settings, accuracy could jump from near zero to over 60%. The second paper, "Zero-shot CoT," proposed the well-known "Let's think step by step." Adding it after "Prompt" activates the model's multi-step reasoning capabilities.
These two findings quickly became industry consensus, and almost all applications requiring complex reasoning began to enable CoT by default. Researchers naturally assumed that since CoT was effective, longer CoTs should be even more effective.
From 2023 to the first half of 2024, much work revolved around how to enable models to generate longer and more refined inference chains. Some methods induced more detailed decompositions through cue engineering, others rewarded longer CoT flows through reinforcement learning, and still others distilled the long inference chains generated by large models into smaller models during training. This pursuit of length reached its peak with the release of O1, which led to the test-time compute revolution, the core of which was to generate longer internal thinking during inference.
Problem discovered
But in the summer of 2024, six months before O1, researchers from different institutions began to question the validity of these ideas.
For example, a team at Stanford University, while analyzing the reasoning behavior of o1 and Claude, noticed that for simple elementary arithmetic problems, these models often generate hundreds or even thousands of tokens of reasoning text, but most of them are repeated verifications, self-doubt, and attempts at multiple solutions, while humans can solve these problems in just two or three mental calculations.
When they manually shortened these lengthy inferences, the accuracy of the answers did not decrease; in fact, it sometimes increased slightly. This suggests that the model may not actually need that much thinking; it is simply driven by post-training rewards to continuously generate new ideas.
In May 2025, a paper titled "When More is Less" provided a more precise characterization of this phenomenon. Using controlled experiments, they constructed inference chains of varying lengths and plotted length-accuracy curves on tasks with varying difficulty levels. They found that the relationship between the length of the inference chain and the accuracy of the final result is actually an inverted U-shaped curve.
Adding more steps within the range not exceeding the peak of U does help, but beyond this range, accuracy begins to decline monotonically. Furthermore, this optimal length varies with task difficulty and model capability. For more difficult problems, the optimal length shifts to the right; however, for more capable models, the optimal length shifts to the left, suggesting that more powerful models are better at knowing when to stop.
The authors of the paper call this phenomenon "simplicity bias." Once the model has grasped the essence of the problem, continuing to generate more tokens only accumulates noise and interference. Once a certain critical point is exceeded, the model falls into a quagmire called "overthinking." In this inverse scaling range, the extra tokens you buy with real money not only fail to increase intelligence but actually reduce accuracy.
Anatomy of COT
So, where exactly did all these super long tokens, often tens of thousands of words long, go?
There are three main patterns in the formation of long reasoning chains, and all of them encounter the problem of overthinking.
The first type is linear expansion. The model progresses step by step, generating new intermediate results at each step, much like sketching. This is the most classic CoT form. The overthinking problem here mainly stems from the model often not knowing when to stop. It continues to verify the calculation even after finding the answer, or repeatedly solves the same problem using three different methods.
The second approach is the reflection cycle. After generating an initial answer, the model triggers a self-questioning mechanism, continuously generating self-correcting text. This is indeed valuable for complex problems, but reflecting on simple problems becomes overthinking.
The third method is multi-path sampling. To improve robustness, the system generates a dozen or even dozens of different inference paths, and finally selects the most consistent answer through voting. This method is indeed effective when solving particularly complex problems, but at the cost of exponentially increasing costs. Moreover, a considerable portion of these candidate inference paths are unreliable, and failing to effectively eliminate them leads to overthinking.
In analyzing the right half of the inverted U-shaped curve, the authors of *When More is Less* found that over 90% of the samples with decreased accuracy contained a large amount of repeated validation and invalid reflection. This means that the essence of overthinking is the desire for repetition. Even when the model already knows the answer, the training mechanism drives it to continuously generate variations and confirmations, and this redundancy is the culprit dragging down accuracy.
Only by understanding these three mechanisms and their loss-of-control modes can we design targeted control strategies.
An attempt to control the length
By mid-2025, academia and industry had reached a consensus on overthinking. The question began to shift from "whether overthinking exists" to "how to accurately identify and control it."
The most direct approach is to set hard limits. For example, methods like Token-Budget-Aware LLM Reasoning explicitly tell the model "you only have this many words to use" in the prompts, forcing it to be concise. However, this simplistic approach has a fatal flaw: it fails to solve the difficult problem.
A better solution is to allow the system to dynamically determine when to stop. The method proposed in *REFRAIN: Reasoning Efficiency via Fine-grained Reflection and Adaptive Inference* involves monitoring redundancy signals in real time during inference. When the model starts repeatedly validating, getting stuck in a cycle of reflection, or falling into a loop of self-doubt, the system will decisively stop. This stopping strategy can reduce token consumption by 20% to 55% without modifying the model itself, while maintaining or even improving accuracy.
Another approach is routing. Frameworks like DynaThink and DAST perform a quick evaluation for each problem. For easy questions like "2+3 equals what?", they simply output the answer; for complex, challenging problems, they initiate a complete inference chain and multi-path sampling. However, the disastrous performance of GPT 5 after implementing routing, with its rampant misjudgments of difficulty, demonstrates that this method is not perfect either.
For computationally intensive models that rely on diverse sampling and voting, researchers have developed an early stopping mechanism. Early-Stopping Self-Consistency (ESC) continuously monitors the sampling process, and once multiple answers have reached a stable consensus, there's no need to waste computational power generating more samples. On mathematical benchmarks like GSM8K, this can reduce the number of samples by 80%.
A more radical approach is to modify the model itself from the source. For example, some researchers pin their hopes on post-training; in the paper "Let's Verify Step by Step," they hoped to solve all problems with a Process Reward Model (PRM). Once the model is trained, it will provide answers according to the optimal solution method, thus avoiding a lot of unnecessary code. Alternatively, one could fine-tune the model using a carefully selected, concise, but correct method to make its output more approximate. However, the design or fine-tuning of PRM remains a very difficult process to control.
Although there are many methods, all of them face a common dilemma: none of them have a particularly reliable signal to determine "when to continue thinking about valuable things and when to just pile up useless texts".
Current solutions mostly rely on surface features, such as recurring patterns, confidence changes, consistency convergence, and historical statistics. These are all indirect indicators, somewhat like observing from the sidelines.
So, what is the essential indicator that distinguishes effective thinking from ineffective redundancy?
02 Seeking Useful Thinking
Google's paper suggests that the most direct way to find evidence of effective thinking is to insert probes deep into the Transformer architecture and observe whether it is actually thinking when generating each word.
When a large model generates a token, the signal is passed through dozens or even hundreds of layers of neural networks for processing. The researchers in this paper discovered that the level of struggle experienced within the model varies significantly depending on the token being generated.
For simple grammatical terms, clichés, or common sense concepts that the model is already familiar with, such as "and," "is," or the "=" in mathematical formulas, the prediction probability is already locked at a very shallow layer of the Transformer. The massive computing power of the subsequent dozens of layers is merely a formality for this word, without any substantial computational modification.
However, for the key tokens that truly require reasoning, such as numbers in the equation, logical connectors, or the answer itself, the model's predictions will be revised to a very deep level before converging.
Researchers used mathematical divergence to measure the distribution difference between intermediate layers and the final output, proposing the "Deep Thinking Rate" (DTR) metric. It is defined as: in a given text, what percentage of tokens remain oscillating until they reach the depths of the network?
If most tokens require deep computation to finalize, the DTR will be high; if they are all simple tokens that can be settled with shallow computation, the DTR will be low.
This indicator can also explain many of the fundamental questions raised earlier.
Why is length negatively correlated with accuracy? Because lengthy reasoning chains are filled with superficial phrases like "let me reconsider..." and "wait, maybe..." which lengthen the sequence but do not generate substantial thought.
Why can short chains maintain high accuracy? Because these chains are highly condensed, requiring deep computation for almost every token, and the DTR is close to its limit.
The paper provides a typical example. When answering the same geometry problem, incorrect samples used 27,724 tokens with a DTR of only 13.9%; correct samples used only 3,725 tokens with a DTR of 19.0%. The former is 90% irrelevant, while the latter is full of valuable information.
To prove that they had truly identified the right standard, they tested multiple inference model families, including GPT-OSS, DeepSeek-R1, and Qwen3, in the 2024 and 2025 AIME math competitions, the 2025 HMMT, and the graduate-level GPQA science quiz. The results showed that DTR was perfectly positively correlated with accuracy.
Therefore, we can confirm that DTR does indeed provide us with a benchmark for the quality of thinking based on the internal dynamics of the model, rather than the surface word count.
Based on DTR, the paper also proposes the Think@n method, specifically optimized for the most expensive multisampling mode. The traditional approach involves generating dozens of complete inference chains before voting; Think@n only requires each thread to output 50 words initially, immediately calculating the DTR. Threads with extremely low DTRs, clearly reciting clichés, are terminated, reserving computational power solely for high-potential candidates demonstrating strong deep computation from the outset. Experiments show that this achieves or surpasses the performance of traditional methods with half the tokens.
However, this paper still leaves a major shortcoming. It merely plays the role of a referee, truncating the generated candidates during the testing phase.
However, the future path is actually quite clear: we can completely transform DTR into a reward signal for the reinforcement learning (RL) stage. If, in the future model alignment stage, we no longer simply reward the model for arriving at the correct answer, but instead use high DTR concentration as one of the reward functions, we can fundamentally change the model's behavior. This will force the model to learn to compress massive computational loads into extremely refined, high-quality outputs.
This is the essential shift from longer-term thinking to deeper thinking. Intelligence is no longer measured by the number of tokens, but by computational density.
03 What is the most efficient way of thinking?
DTR does indeed provide a good standard for observing whether a model is doing more meaningful thinking, fundamentally solving the problem of overthinking identification.
But it didn't answer why these deeper reflections were more effective.
A recent paper by Carnegie Mellon and NYU, "From Entropy to Apparent Complexity: Reinventing Information Theory for Computationally Constrained Agents," provides us with a clue in information theory.
Traditional information theory focuses on random information, or entropy. Shannon tells us that the amount of information in a text depends on its unpredictability. The higher the entropy, the more information is contained within it.
However, this doesn't explain deep learning self-game models like AlphaGo at all. The input you give it is just the game rules, which have very low entropy, but through computation (reasoning process), the model can master a wide variety of outputs.
The paper argues that the key lies in the fact that all intelligent agents have finite computing power. We cannot learn infinitely from entropy; therefore, for such systems, the value of data lies not in its randomness (entropy), but in the learnable structural complexity it contains.
For observers with limited computing power (such as human players or AI models), brute-force enumeration of the entire game tree is impractical, so it is necessary to extract higher-order abstract patterns. The length of these patterns far exceeds the game rules themselves.
This is why COT is useful.
They defined this structural complexity as epiplexity.
A randomly generated string of API keys may have high entropy, but its epiplexity is close to zero because the model learns nothing transferable from it. Conversely, a piece of algorithm code may have low entropy, but high epiplexity because understanding it requires the model to build complex internal representations.
This explains why inference with high DTR is more efficient, because it generates more Epiplexity.
When the model performs deep reasoning, it is not simply retrieving memories or applying surface rules, but building new cognitive structures in real time .
Traditional theory would say this is impossible because deterministic transformations cannot add information. But Epiplexity tells us that these strategies do not appear out of thin air, but are structures created by the computational process itself .
This paper redefines the reasoning process as a generator of structured information .
Traditional views hold that reasoning involves searching the solution space. However, the epiplexity perspective tells us that good reasoning is not merely a search, but rather a dynamic alteration of how the solution space is represented . Just as mathematicians don't brute-force enumerate theorems, but invent new mathematical objects and new proof techniques to simplify complex problems.
The common characteristic of these steps is that they all add additional structure to the problem space . Truly valuable reasoning tokens should be those that force the model to build new internal structures, discover new patterns, and extract more abstract rules . Their characteristic is that their generation requires utilizing the full computational power of deep networks (high DTR), because shallow pattern matching is no longer sufficient.
Moreover, this can also change our understanding of intelligence. It's not about how much information is processed, but about how much structure is created . AlphaZero created Go strategies through self-play, human scientists created physical theories through experiments, and language models created structured representations of problems through deep reasoning. They are all essentially the same: computationally limited intelligent agents attempting to extract compressible patterns from the world.
If we place this evolution from CoT to overthinking and then to deep thinking within a larger historical context, it is actually a microcosm of the transformation of AI systems from capability-driven to resource-rational. The early deep learning revolution solved the question of "can we," such as "can we recognize images," "can we generate text," and "can we win at Go?" The test-time compute revolution promoted the question of "can we do more difficult tasks"—can we prove mathematical theorems," "can we write bug-free code," and "can we plan complex projects?"
But now, as these capabilities have matured, the marginal question has become "how to do it most economically." How to achieve the same quality with the least amount of computation, how to dynamically allocate resources based on task difficulty, and how to avoid wasting computation in useless directions.
The emergence of the overthinking problem is an inevitable product of this transitional period.
From this perspective, DTR and epiplexity are not just measurement tools, but a new design philosophy. They tell us that the value of thinking lies not in how much text is generated, but in how much structured computation is invoked behind the text, and to what extent these computations can be transferred to new tasks.
This is the real leap from Think Long to Think Deep, and an effective way to solve the computing power bottleneck in a world where tokens are becoming increasingly scarce.
This article is from the WeChat official account "Tencent Technology" , author: Bo Yang, and published with authorization from 36Kr.



