歸藏(guizang.ai)'s Insight

06-14

This article is machine translated

Show original

This Anthropic article is worth reading It elaborates in detail on the engineering challenges faced in building multiple agents to more efficiently explore complex topics, including agent coordination, evaluation, and reliability. Let OP summarize the main points

Anthropic

@AnthropicAI

06-14

New on the Anthropic Engineering blog: how we built Claude’s research capabilities using multiple agents working in parallel. We share what worked, what didn't, and the engineering challenges along the way. https://anthropic.com/engineering/built-multi-agent-research-system…

Multi-agent systems are able to improve performance through: Parallel operation and information compression: Sub-agents are able to run in parallel with their own context windows, exploring different aspects of a problem simultaneously, and then distilling the most important information to the main research agent. Separation of concerns: Each sub-agent provides a separation of concerns—different tools, prompts, and exploration trajectories—which reduces path dependencies and enables thorough, independent investigation. Scaling performance: Once intelligence reaches a certain threshold, multi-agent systems become an important way to scale performance, just as human society has achieved exponential growth through collective intelligence and coordination. Excellent breadth-first query capabilities: Internal evaluations have shown that multi-agent research systems excel at breadth-first queries that involve pursuing multiple independent directions simultaneously4. For example, when asked to identify all board members of an information technology S&P 500 company, the multi-agent system found the correct answer by breaking this task down into tasks for the sub-agents, while the single-agent system failed to find the answer through slow sequential search. Efficient token usage: The multi-agent system is able to consume enough tokens to solve the problem. Analysis shows that token usage alone explains 80% of the performance variance in the BrowseComp evaluation, while the number of tool invocations and model choice are two additional explanatory factors.5 The multi-agent architecture effectively scales token usage by distributing work to agents with independent context windows, increasing the ability to parallelize reasoning.

Agent systems also have their drawbacks: They typically consume a lot of tokens quickly. In Anthropic’s data, agents typically use about 4 times more tokens than chat interactions, and multi-agent systems use about 15 times more tokens than chat. Therefore, multi-agent systems require the value of the tasks to be high enough to pay for their increased performance cost in order to be economically viable. In addition, some domains that require all agents to share the same context or involve dependencies between many agents are currently not suitable for multi-agent systems, such as most encoding tasks.

Architecture Overview: Anthropic's research system uses a multi-agent architecture with an orchestrator-worker model. One primary agent coordinates the entire process while delegating tasks to specialized parallel-operating sub-agents. The workflow is as follows: 1. After a user submits a query, the primary agent (LeadResearcher) analyzes the query, develops a strategy, and generates sub-agents to simultaneously explore different aspects. 2. LeadResearcher first thinks through its method and saves the plan to memory to preserve context in case the context window exceeds 200,000 tokens and gets truncated. 3. It then creates specialized sub-agents (Subagents) and assigns specific tasks. 4. Each sub-agent independently performs web searches, uses interleaved thinking to evaluate tool results, and returns findings to LeadResearcher. 5. LeadResearcher synthesizes these results and decides whether more research is needed—if so, it can create additional sub-agents or adjust its strategy. 6. Once sufficient information is collected, the system exits the research loop and passes all findings to a CitationAgent, which processes documents and research reports to identify specific citation locations, ensuring all claims are correctly attributed to their sources. 7. The final research results (with citations) are then returned to the user. Unlike traditional methods using retrieval-augmented generation (RAG), Anthropic's architecture uses multi-step search, dynamically finding relevant information, adapting to new discoveries, and analyzing results to form high-quality answers.

Prompt Engineering and Evaluation Multi-agent systems differ from single-agent systems in key ways, including the rapid growth of coordination complexity. Prompt engineering is Anthropic’s primary means of improving agent behavior. Prompt principles learned include: 1. Think like your agent: Understand the effects of prompts and observe agents working step by step through simulation to discover failure modes. 2. Teach coordinators how to delegate tasks: The master agent needs to break down queries into subtasks and describe them to the subagents. Each subagent needs clear goals, output formats, guidance on the use of tools and sources, and clear task boundaries to avoid duplication of work or missing information. 3. Adjust effort based on query complexity: Embed scaling rules in prompts to help the master agent allocate resources efficiently and prevent over-investment on simple queries10. Simple queries may require only 1 agent and 3-10 tool calls, while complex studies may require more than 10 subagents. 4. Tool design and selection are critical: The agent-tool interface is as important as the human-machine interface. Make sure each tool has a clear purpose and a clear description, and provide the agent with clear heuristics (e.g., prefer specialized tools over general-purpose tools). 5. Let the agent improve itself: Claude 4 models can serve as excellent prompt engineers. When given prompts and failure modes, they can diagnose the cause of failure and suggest improvements. Anthropic even created a tool testing agent that can test flawed tools and rewrite their descriptions to avoid failure. 6. Go broad first, then go deep: The search strategy should mimic human expert research: explore the general situation first, then go into the details. Start with short, broad queries by prompting the agent to evaluate the available information, and then gradually narrow the focus. 7. Guide the thinking process: Extended thinking mode acts as a controllable scratchpad, causing Claude to output additional tokens for planning, evaluating tool suitability, determining query complexity and the number of sub-agents, and defining the role of each sub-agent. 8. Parallel tool invocation improves speed and performance: By having the main agent launch sub-agents in parallel, and sub-agents use multiple tools in parallel, the research time for complex queries is reduced by up to 90%.

Effective Evaluation of Agents Evaluating multi-agent systems presents unique challenges because agents may take completely different and effective paths to reach their goals, even if they start from the same point. Evaluation methods need to be flexible, judging both whether the agent has reached the right outcome and whether its process is sound. Key evaluation methods include: Start small sample evaluation immediately: In the early stages of development, even a few test cases can reveal large effects because effect sizes are often large. Evaluation with LLMs as referees: Research outputs are often free-form text with no single correct answer, and LLMs are suitable as scoring referees14. Anthropic uses LLM referees to evaluate outputs based on criteria such as factual accuracy, citation accuracy, completeness, source quality, and tool efficiency. Human evaluation uncovers issues that automation misses: Human testers are able to discover edge cases that automated evaluations may miss, such as phantom answers on unusual queries, system glitches, or subtle source selection biases. Multi-agent systems have emergent behavior, which is not specifically programmed. Understanding interaction patterns is critical, and the best hints are not rigid instructions but collaborative frameworks that define division of labor, problem-solving methods, and effort budgets.

Production Reliability and Engineering Challenges Moving agent systems from prototypes to reliable production systems presents significant engineering challenges due to the compounding nature of errors in agent systems. Key challenges include: Agents are stateful and errors accumulate: Agents can run for long periods of time and maintain state across multiple tool invocations. Minor system failures can have catastrophic effects on agents. Anthropic built systems that can recover from errors where they occur and leverage the intelligence of the model to handle issues gracefully, such as notifying the agent when a tool fails and letting it adapt. Debugging requires a new approach: Agents make dynamic decisions and are non-deterministic between runs, even for the same prompt, which makes debugging more difficult. By adding full production tracing, Anthropic was able to diagnose why agents failed and systematically fix the issues. Deployments require careful coordination: Agent systems are highly stateful networks of prompts, tools, and execution logic that run nearly continuously. Anthropic uses rainbow deployments to avoid disrupting running agents by gradually shifting traffic from the old version to the new version while keeping both running simultaneously. Bottlenecks caused by synchronous execution: Currently, Anthropic's master agent executes sub-agents synchronously, waiting for each group of sub-agents to complete before continuing. This simplifies coordination, but creates bottlenecks in the flow of information between agents, such as the master agent being unable to guide the sub-agents and the entire system may be blocked. Asynchronous execution would enable additional parallelism, but would increase challenges in result coordination, state consistency, and error propagation.

From Twitter

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

CoinDesk

Bitcoin slips below $70,000 after erasing post-election gains during 'sell at any price' rout

BTC

1.14%

ME News

Breaking News! The Year of China's RWA: A Compliant Channel Opens for Trillions of Yuan in Domestic Assets to Go Global

ME News

Is Hyperliquid entering the prediction market arena, aiming for the heart of the polymarket?

BTC

1.14%