Scaling Law? The world model GPT-4o allows intelligent agents to super plan, OSU Chinese first author

avatar
36kr
a day ago
This article is machine translated
Show original

The Scaling Law has hit a wall, and it is really difficult to compute the reasoning time of language intelligence agents! The solution is to use LLM as the world model? The research team from OSU found that using GPT-4o as the world model to support planning in complex environments has great potential.

Can the Scaling Law be revived?

How to scale up the reasoning time computation through the advanced planning of language intelligence agents?

The answer is - using LLM as the world model.

That is, using GPT-4o to predict the results of operations on websites can provide powerful performance, while also improving security and efficiency.

Recently, researchers from institutions including The Ohio State University proposed a brand-new WebDreamer framework, which can utilize LLM as the world model to predict the results of interactions on websites.

Paper link: https://arxiv.org/abs/2411.06559

A few days ago, at the Microsoft Ignite conference, Nadella stated that the development of AI has not reached its ceiling, and we are witnessing the rise of the Scaling law of reasoning time computation.

Yes, this new research is a step in that direction.

01 The key difference between language intelligence agents and mathematical reasoning is interaction

The first author, Yu Gu, said that since the release of GPT, this problem has been bothering him -

Why is it so difficult to scale up the reasoning time computation of language intelligence agents? What is special about language intelligence agents?

To this end, he broke down the problem.

Unlike tasks such as mathematical reasoning, a key difference of language intelligence agents is interaction: each action they take will trigger new observations of the environment, providing information for their next decision.

And this interaction makes the exploration of the search space complex, because -

1. Interaction with the environment is expensive

2. Many operations are state-changing and irreversible (such as confirming a purchase on a shopping website), making backtracking in tree search infeasible in real-world websites.

Can we use LLM as a world model to predict the results of interactions on websites? (e.g. "What will happen if I click this button?")

This way, we can achieve effective search space exploration and reduce the cost of actual interactions.

The answer is yes!

Yu Gu and his team found that GPT-4o effectively encodes a wide range of knowledge about websites and serves as the foundation for the model-based planning framework WebDreamer.

Equipped with the LLM-simulated world model, WebDreamer has demonstrated strong effectiveness and efficiency.

First, it has excellent performance: it far outperforms reactive baselines on VisualWebArena and Mind2Web-live.

In terms of efficiency, it requires only half the number of interactions compared to tree search.

Additionally, the LLM-based world model simulation provides two extra advantages.

One is better security, as it can reduce security risks by minimizing real-world interactions.

The other is multi-functional integration: it can seamlessly work as a plugin for various agents and complement tree search agents.

02 The core of WebDreamer is "dreaming"

Do intelligent agents need to dream too?

Unlike tasks such as mathematical reasoning, a key difference of language agents is interaction: each action they take will trigger changes in the environment, which in turn pose challenges for their further decision-making.

Constant interaction makes the search space exploration exceptionally difficult, because the computational cost of interacting with the environment is high; many state-changing operations are irreversible; and there are certain security risks in using agents to actually interact with websites, such as information leakage and accidental financial losses.

How to effectively explore the search space, while reducing the cost of actual interactions and ensuring the safety and reliability of the agents, has become an urgent problem to solve.

In short, the core of WebDreamer is the concept of "dreaming": before committing to any action, the agent uses LLM to imagine and predict the results of each possible step, and describes the changes in natural language.

Then, it evaluates these simulated results based on their distance from the target task. Finally, it executes the simulated action most likely to achieve the target task. This process is repeated until the LLM determines that the target has been achieved.

Figure 1 illustrates the different strategies of a web agent represented as a search problem, where each node represents a web page.

For clarity, only the results of one-step simulation are described. Faded nodes represent unvisited pages, and green checkmarks and red crosses indicate successful and unsuccessful results, respectively.

Figure 1(a) Reactive: Due to the agent's always choosing the local optimum without foresight planning, it often leads to suboptimal results.

Figure 1(b) Tree search with real interactions: The agent actively explores multiple paths by navigating the website and allows backtracking (represented by dashed arrows). However, in real-world websites, backtracking is often infeasible due to the prevalence of irreversible operations.

Figure 1(c) Model-based planning: Before actual execution, the agent simulates potential results (as shown by the cloud-like nodes) to determine the best action, thereby maintaining effectiveness while minimizing actual website interactions.

In summary, with the support of the LLM-simulated world model, WebDreamer has demonstrated excellent performance, efficiency, and strong extensibility:

Performance: It significantly outperforms reactive baseline models on VisualWebArena and Mind2Web-live.

Efficiency: Compared to tree search, it requires only half the number of interactions.

Security: By reducing real-world interactions, it effectively lowers security risks.

Integration: It can seamlessly run as a plugin for various agents and complement the functionality of tree search agents.

03 Preparation

Task Formulation

For the target task of real-time automated interaction on websites, web agents face a vast and complex search space.

Formally, each task with an instruction I can be viewed as a partially observable Markov decision process (POMDP): (S, A, O, T, R, Ω).

'TRON' is translated into 'TRON'.

'Mina' is translated into 'Mina'.

'Prom' is translated into 'Prom'.

'Ren' is translated into 'Ren'.

'HT' is translated into 'HT'.

'MINA' is translated into 'MINA'.

'ENJ' is translated into 'ENJ'.

'AR' is translated into 'AR'.

'ONT' is translated into 'ONT'.

'RON' is translated into 'RON'.

'ONG' is translated into 'ONG'.

In the environment, S represents the set of all possible states, A represents the set of all possible actions that the agent can take, O represents the set of all possible observations, T: S × A → S represents the state transition function, R is a binary reward that indicates whether the task I has been completed, and Ω: S → O is a deterministic function that projects the state to an observation.

The goal of the task is to execute a series of actions to obtain a reward of 1.

In real scenarios, due to the complexity of the network environment, which includes server-side variables, dynamically loaded content, hidden UI elements, and is affected by network conditions and browser limitations, the agent can only perceive the network environment through a limited perspective (i.e., o ∈ O).

This limited observation perspective also forms the corresponding action space A, which includes the interactive operations that can be executed in o, such as clicking, text input, and URL navigation.

Table 1 The network navigation action space defined in VisualWebArena

Planning through Simulation

Planning the optimal action sequence by performing tree search with the real interactions controlled by the state transition function "T" is costly and has irreversible risks. Model-based planning solves these challenges by using the computational representation of the environment to simulate the interaction results.

A prominent method is Model Predictive Control (MPC), which iteratively simulates future trajectories to select actions.

For each state s, MPC uses the simulation function sim(s, a) to simulate the trajectory of each possible action a ∈ A within a finite prediction range H, and evaluates them using the scoring function score(τ). It then executes the action corresponding to the most promising trajectory:

This process is repeated upon observing a new state, allowing the agent to adjust its plan based on the actual results while avoiding the costly real-world exploration. In fact, due to partial observability, we cannot access the true state, so we use o = Ω(s) to compute sim(o, a).

04 Model-Based Planning for Web Agents

The authors leverage LLMs as the world model and propose a pioneering method: WebDreamer, to achieve efficient planning in complex web environments.

This method is inspired by the observation that although web interfaces are complex, their design is predictable for human users.

When browsing websites, humans can effectively predict the consequences of actions based on visual cues and common design patterns - clicking the "Submit" button will submit the form, and selecting a product image will navigate to its detail page.

Given that LLMs are trained on a large amount of web-related data, the authors hypothesize that they have acquired sufficient knowledge to simulate the consequences of user actions, making them suitable as world models for effective planning.

Core Design

The core of WebDreamer is to use LLMs to implement the simulation function sim and the scoring function score.

The figure below illustrates WebDreamer using LLMs to simulate the results of three candidate actions, where WebDreamer simulates two-step trajectories for each action, selects the highest-scoring trajectory, and executes the corresponding initial action.

The figure explains the trajectories simulated by the LLM for three candidate operations described in natural language:

(1) Click "Office Products"

(2) Click "Electronics"

(3) Type "Disk" in the text box

Through these simulations, the trajectories are scored to determine the most likely successful action. In this case, the LLM selects clicking "Electronics" as the best step and executes it. Each dashed box represents the state description generated by the LLM after each simulated operation.

Implementing sim

The implementation of the simulation function sim consists of two modules: one module predicts the state changes after the action is executed, approximating the state transition function "T"; the other module imagines possible actions based on the predicted state.

These two modules jointly generate a trajectory of length H, where H is a configurable simulation depth parameter.

Specifically, to represent the state changes, the researchers prompt the LLM to generate a concise natural language description focusing only on the effects of the action.

For example, in Figure 2, when prompted to predict the effect of executing the action of clicking "Electronics", the LLM outputs the following short description:

Based on this predicted state, the LLM will then imagine the next action (e.g., clicking "Computers & Accessories"), which will lead to another state change and further prediction.

This process generates a trajectory with a simulation depth of H=2.

Implementing score

After using sim to simulate a trajectory τi for each candidate action ai, the researchers further use the LLM as the scoring function for each simulated trajectory.

They prompt the LLM to evaluate each simulated trajectory based on three scoring criteria - completed (1.0), in progress (0.5), or incorrect (0) - to indicate the progress of task completion.

The final score is calculated by averaging multiple samples of these evaluations. In addition to sim and score, a prerequisite for planning is the generation of candidate actions.

The researchers adopted a two-stage approach: first, they sample the top k actions, and then use the LLM to self-optimize, removing unnecessary actions for simulation.

The motivation for this self-optimization step is that the researchers observed that the same k can introduce different degrees of irrelevant actions in different steps - some steps can be implemented with fewer effective actions than others.

In Algorithm 1, they show the pseudocode of the overall design of WebDreamer. The termination check is used to verify if the model outputs a stop action, with the rule that the algorithm stops when it reaches the maximum step or repeats the same action three times consecutively.

The complete system prompts are as follows:

05 Experimental Results

Effectiveness

As shown in Table 2, WebDreamer demonstrates significant improvements over the reactive agent on the VWA and Mind2Web-live datasets:

On the VWA dataset, it achieved a 33.3% relative performance improvement.

On the Mind2Web-live dataset, it improved by 2.9% (a 13.1% relative gain) compared to the Reactive paradigm.

Although the overall success rate is still higher for the tree search-based solution, it is not actually applicable to real-world web scenarios. WebDreamer, on the other hand, provides a more flexible and adaptive alternative.

Table 2: Results on VisualWebArena and Mind2Web-live

Furthermore, the researchers compared WebDreamer's multi-dimensional performance with the Reactive paradigm on the VWA dataset.

Table 3 shows that the model-based planning method consistently outperforms the Reactive paradigm-based method across all websites and task difficulty levels.

In the medium-difficulty tasks according to the VWA official annotation, the model-based planning even surpassed the tree search solution (24.1% VS 22.2%).

Metric

Used to measure the relative performance of model-based planning and tree search solutions.

Table 3: Success rate corresponding to different dimensions

Efficiency

Another key advantage of model-based planning is its high efficiency relative to tree search when executing tasks.

As shown in Table 4, the number of steps required by tree search is about three times that of the baseline in all environments, while the corresponding action steps of WebDreamer are similar to the baseline.

It is worth noting that due to the additional actions and backtracking, tree search introduces about ten times the actual time delay, while the simulation overhead of WebDreamer is small and can be further reduced by enhanced parallelization.

Table 4: Action steps and total time consumption on VWA

Case Study

To illustrate the role of simulation in planning, the researchers proposed a case study with positive and negative examples, demonstrating how simulation helps the agent explore the environment and how inaccuracies in simulation can lead to erroneous predictions.

Errors caused by inadequate world model simulation are as follows:

The researcher's instructions to the agent were: Find me a printer of the same brand as the product in the picture. It must be white and have at least 11 reviews with an average rating greater than 4.

The positive case study benefiting from the world model simulation is as follows:

In this case, the agent correctly found two shirts with birds in front.

06 Author Introduction

Yu Gu

Yu Gu is a doctoral student at The Ohio State University, having previously obtained bachelor's and master's degrees in computer science from Nanjing University.

Boyuan Zheng

Boyuan Zheng is currently a first-year doctoral student at The Ohio State University, supervised by Professor Yu Su.

Prior to this, he obtained a bachelor's degree in software engineering from Northeastern University and a master's degree in computer science from Johns Hopkins University, where he collaborated with Professor Benjamin Van Durme.

His main research focus is on developing language-based intelligent agents that can liberate humans from tedious tasks and assist in decision-making, particularly in web environments. Other areas of interest include multimodal, foundational, planning and reasoning, synthetic data, and agent safety.

References:

https://arxiv.org/pdf/2411.06559

This article is from the WeChat public account "New Intelligence", authored by New Intelligence, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments