When robots face complex tasks in real-world environments, how can they not only execute single simple instructions, but also autonomously reason through the multiple steps required to achieve the goal, and complete the task as well as humans do?
To this end, the answer from the American embodied intelligence startup Physical Intelligence is—— to teach robots to think with System 2 cognition.
The famous American psychologist Daniel Kahneman described two modes of human problem-solving as "System 1" and "System 2". System 1 is intuitive, instinctive and automatic; System 2 is deliberate and conscious.
For example, when a human makes a new dish, they will look at the recipe, prepare the ingredients, and carefully think through each step in the cooking process. This is the System 2 mode of thinking. But when someone does the same task for the hundredth time, they become so skilled that they can do it almost without thinking, just mechanically completing the task - this is the System 1 mode.
Yesterday, Physical Intelligence released the "Hierarchical Interactive Robot" (Hi Robot) system, which can incorporate the vision-language-action (VLA) model, such as π0, into a hierarchical reasoning process. π0 as the instinctive "System 1" can execute skilled tasks, while a higher-level semantic vision-language model (VLM) serves as "System 2", reasoning about complex tasks and language interactions through "inner speech". This high-level System 2 strategy enables the robot to decompose complex tasks into intermediate steps.
Let's take a look at the official video first:
According to the introduction, this high-level strategy itself is a VLM, which uses the same VLM backbone as π0, and after training can handle complex prompts, observe scenes, and decompose tasks into easy-to-execute small steps, passing these steps (such as "pick up a slice of whole wheat bread") to the π0 VLA model for execution, while incorporating real-time contextual feedback.
For example, if it is cleaning the table and the user says "that's not trash", the model will understand the meaning, associate the object ("that") with the object the robot is currently operating on in the image, and correctly understand the implied instruction (i.e., "that" should not be put in the trash can, so it should be placed elsewhere), and then pass the correct intermediate steps to the π0 model for execution again.
Image | The high-level strategy processes open-ended instructions and images from the base and wrist cameras, generating low-level language instructions. The low-level strategy uses these instructions, images, and robot state to generate actions and optional language responses.
The related research paper is titled "Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models" and has been published on the preprint website arXiv.
Paper link: https://www.arxiv.org/abs/2502.19417
What are the advantages of hierarchical reasoning?
If the high-level Hi Robot strategy and the low-level π0 model are both based on the same VLM, why does this hierarchical reasoning process actually have advantages?
Just as language models can "think" by generating additional text to solve complex problems, if Hi Robot can first decompose complex prompts and feedback into simple steps, and then hand them over to the π0 model to execute, it can better handle these complex prompts and feedback. There is also a more technical reason: The Physical Intelligence team's network-scale pre-training used to initialize the VLM can train the model to generate text answers to respond to prompts and questions involving image and text context. This means that these models are already very good at answering questions like "In this image, which object should the robot next grab to clean the table?"
Therefore, Hi Robot can better inherit the knowledge accumulated by the VLM in large-scale network pre-training. This is very similar to how you think when making that new dish: you may be thinking about things you learned from the recipe, things your friends told you, or things you learned from cooking shows - knowledge gained from other sources, not just personal experience.
Robots learn to "talk to themselves"
The Physical Intelligence team says that by examining Hi Robot's internal "thinking" when faced with complex prompts, they can understand how the system completes complex tasks based on user prompts.
In this case, π0's training is simply to clean the table, throwing all the trash in the trash can and putting all the utensils in the trash box. If you let π0 do it on its own, it will just execute this task - you've had the experience of "autopilot", where you unconsciously complete a skilled task, even forgetting what you originally wanted to do. But under Hi Robot's control, π0 can be adjusted according to this more complex prompt, and according to the user's command, Hi Robot will reason out the modified instructions to provide to π0. Since these instructions are generated in natural language, they can be inspected and observed how the robot "talks to itself" to execute the task.
Interpreting user context feedback is a similar problem, and just as Hi Robot can parse complex prompts, it can also incorporate feedback in real-time during task execution.
Train high-level strategy with synthetic data
Training robots to follow complex, open-ended prompts requires not only demonstration data with atomic instructions. Such data alone is unlikely to provide enough rich multi-step interaction examples. To fill this gap, the Physical Intelligence team proposed a synthetic annotation dataset approach—— pairing the robot's observations and human-labeled skills with hypothetical prompts and human interjections. This method simulates real-world interactions, helping the model learn how to interpret and respond to complex instructions.
The Physical Intelligence team evaluated the performance of Hi Robot on actual tasks (such as cleaning tables, making sandwiches, and shopping), and compared it with previous methods. The results show that Hi Robot outperforms GPT-4o and flat VLA strategies in performance. As shown in the quantitative evaluation below, Hi Robot's instruction following accuracy is 40% higher than GPT-4o, indicating stronger alignment with user prompts and real-time observations. In addition, Hi Robot outperforms the flat VLA strategy in handling multi-step instructions, adapting to real-time corrections, and adhering to constraints.
Reasoning like humans
Intelligent and flexible robot systems need not only to execute dexterous tasks, but also to understand the environment and reason about complex multi-step problems. On the surface, Hi Robot focuses on interacting with users through prompts and feedback, but the ultimate goal of this system is to give robots a "inner voice" similar to what you hear when solving difficult problems like making a new recipe.
Robots that can think about complex problems and apply knowledge learned from large-scale network pre-training will be more flexible, exhibit significantly better commonsense reasoning, and in the long run, provide more natural assistance in open-world environments. They will be able to understand the meaning when someone writes "do not erase" on a whiteboard, know not to disturb someone who is sleeping, and be aware that fragile items should be handled carefully. These are the kinds of inferences we make every day based not only on personal experience, but also on what we've learned from others.
LLM and VLM provide us with powerful tools to learn this kind of knowledge from the internet, but the huge technical challenge is to seamlessly connect this knowledge with physical systems such as robots. The Physical Intelligence team hopes that Hi Robot can be an important step in this direction.
Reference link: https://www.pi.website/research/hirobot
This article is from the WeChat public account "Academic Headlines", compiled by Chen Xiaoyu, and published with authorization from 36Kr.




