[Introduction] Traditional intelligent agent systems struggle to balance stability and learning capabilities. Stanford and other researchers proposed the AgentFlow framework, which continuously optimizes policies during inference through modular and real-time reinforcement learning, enabling small-scale models to outperform GPT-4o in multiple tasks, thus opening up new avenues for AI development.
The development of AI agents is currently facing a dilemma:
On the one hand, training a "multi-functional" large model to simultaneously perform reasoning, planning, and tool invocation has an integrated advantage, but in long-chain reasoning, training is often unstable and scalability is limited.
On the other hand, while prompt-based agent systems are flexible, they lack the ability to learn and self-optimize, and cannot continuously evolve from interactions.
How can we break through this bottleneck?
A research team from Stanford University, Texas A&M University, UC San Diego, and Lambda has provided a new answer: enabling intelligent agent systems to perform online reinforcement learning within the inference "flow," thereby achieving continuous self-improvement and capability evolution .
They proposed that the AgentFlow framework adopts a modular architecture, with four specialized agents working together, and is equipped with a specially designed Flow-GRPO algorithm, enabling the system to continuously optimize decision-making strategies in real interactive environments.
Experimental results show that AgentFlow, with only 7B parameters, outperforms GPT-4o (approximately 200B parameters) and Llama-3.1-405B across multiple tasks, including search, mathematics, and science.
The team leader shared their work on Twitter, which garnered significant attention.
This work has now reached number two on HuggingFace Paper's daily chart and is the most popular Huggingface project of the week.
The Credit Allocation Problem in Long Chain Inference
The core challenge in training intelligent agent systems is the multi-turn credit assignment problem: how to accurately determine the contribution of each decision to the final result in a long-term, reward-sparse environment?
Traditional single-model approaches integrate all functionalities into a single LLM, using special tags (such as <tool_call>) to unify the output of thoughts, tool calls, and responses.
This approach is effective for short-chain tasks, but it is prone to problems in complex scenarios: excessively long inference chains lead to unstable training, errors in tool selection are difficult to trace, and the strategy cannot be dynamically adjusted based on environmental feedback.
While existing intelligent agent systems (such as LangGraph, OWL, Pydantic, and AutoGen) have achieved modularity, most of them rely on fixed prompt projects and lack mechanisms for learning from experience.
AgentFlow enables real-time interaction across multiple modules, learning within a "stream".
AgentFlow's design philosophy is to decompose complex reasoning tasks into specialized agent modules, while allowing the core decision-making module to continuously learn through interaction .
Four-module collaborative architecture
The system consists of four specialized intelligent agents with memory capabilities:
- Analyzing task requirements, formulating execution strategies, and selecting the most suitable tools. This is the core decision-making module of the system, and the only part that requires training.
- : Responsible for actually calling the tool's API and integrating the results returned by the tool.
- Based on the system's accumulated historical memory, evaluate whether the intermediate results meet the task objectives and constraints.
- Integrate all information and verification feedback to generate a final answer or next action suggestion.
The key innovation lies in the fact that the planner is not static, but is optimized in real time during the inference flow through on-policy reinforcement learning.
After each round of interaction, the system updates the planner's decision-making strategy based on the success or failure of the final result, and integrates the optimized results into the system's memory, forming a closed-loop adaptive learning process.
The Flow-GRPO algorithm solves the credit allocation problem.
The team proposed the Flow-GRPO (Flow-based Group Relative Policy Optimization) algorithm, specifically designed for multi-round inference scenarios. The core idea is to broadcast the final reward signal (success/failure) of the trajectory to each action, transforming the complex multi-round reinforcement learning problem into a series of single-round policy updates.
The specific steps are as follows:
1. Collect the complete reasoning trajectory (from the initial task to the final result);
2. Calculate the outcome reward based on the final result;
3. Assign this reward to each planned action in the trajectory;
4. Calculate the advantage of each action using the relative advantage function and update the policy gradient.
This method effectively alleviates the reward sparsity problem while maintaining training stability.
Online learning enables the system to: quickly correct erroneous tool calls, explore better subtask decomposition methods, and dynamically adjust inference depth based on environmental feedback.
Experimental Results: The Small Model's Comeback
The research team conducted systematic evaluations on 10 cross-domain benchmarks, covering four major categories: knowledge retrieval, agent tasks, mathematical reasoning, and scientific reasoning.
Performance Comparison
Using Qwen-2.5-7B-Instruct as the base model, AgentFlow significantly outperforms in all categories.
Knowledge retrieval: 14.9% improvement compared to baseline.
Agent reasoning: Improved by 14.0%
Mathematical reasoning: improved by 14.5%
Scientific reasoning: Improved by 4.1%
Even more surprising are the results of cross-scale comparisons:
AgentFlow 7B outperforms GPT-4o (approximately 200B) by 8.2% on search tasks.
It outperforms Llama-3.1-405B by 15.8% in agent-based tasks.
AgentFlow of the 3B model also outperforms the baseline 405B model in multiple tasks.
Key findings from ablation experiments
1. Online learning vs. offline learning
Comparative experiments show that training the planner using the traditional SFT method actually results in an average performance decrease of 19%. This demonstrates that online learning in a real-world interactive environment is a necessary condition for achieving efficient reasoning .
2. Independently explore new strategies
Choose appropriate tool combinations based on the characteristics of the task ; at the same time, the trained system will spontaneously explore new tool usage patterns, such as combining Wikipedia search and web search to obtain deeper information mining through the toolchain, while these patterns hardly appear in the untrained inference flow.
3. Dynamic reasoning depth
In intensive reasoning tasks such as multi-hop search, the trained AgentFlow exhibits "intelligent laziness": it maintains fewer reasoning steps for simple tasks and increases reasoning depth only for complex tasks.
As the maximum number of steps is increased, performance steadily improves, but the average number of steps does not increase proportionally.
4. The Value of Module Collaboration
While inference streams themselves can improve performance, untrained systems are prone to loop errors or stuttering.
After training with reinforcement learning, the system showed significant improvements in tool invocation accuracy, subtask planning finesse, and global performance. The authors provided an example to vividly demonstrate the interesting findings from their experiments.
In this example, the inference system before Flow-GRPO training will repeatedly output the same sub-targets and tool calls when it encounters Python variable definition errors such as the one shown here, which greatly wastes time and inference efficiency.
After the Flow-GRPO online update, the action planner was able to automatically adjust to guide subsequent steps with more precise sub-goals and task descriptions based on previous errors, and after this adaptation, it succeeded in one step.
This example also demonstrates the immense potential of reinforcement learning in real-world reasoning within intelligent agent systems.
Technological significance and future prospects
The value of AgentFlow lies in:
1. A new training paradigm is provided, demonstrating that agent systems can acquire learning capabilities similar to large models through online reinforcement learning, and are more efficient on specific tasks.
2. Verified the feasibility of "small but excellent": With reasonable system design, small models can outperform large-scale general models in complex reasoning tasks through modular collaboration and continuous learning.
3. Providing ideas for scalable AI: Modular architecture allows the system to flexibly add new tools and adjust module functions.
AgentFlow shows us at least that the development of Agentic AI does not have to rely entirely on stacking up model size; innovative system architecture combined with efficient training methods may be a more worthwhile direction to explore.
References:
https://arxiv.org/abs/2510.05592
This article is from the WeChat official account "New Intelligence" , edited by LRST, and published with authorization from 36Kr.




