In the latest wave of AI innovation, robotics innovation will also benefit, and this latest AI boom is no exception. However, most of the emerging ones are specialized robots, and the intelligence of these robots is also specialized in a certain field. The limitation of this approach is that the research and development results cannot be reused, and the models and hardware used by the robots are only suitable for a very small field.
Recently, a prototype of a general robot brain has emerged. A robotics company called Physical Intelligence has trained a general robot basic model called π(0), whose intelligence is basically applicable to any robot application. This means that when this type of general model matures, to develop a specialized robot in a certain field, at least in the "brain" part, only industry data fine-tuning is needed. This is like software entrepreneurs wanting to start a business in a certain niche, only need to fine-tune GPT-4.
Physical Intelligence received two rounds of financing in 2024. In March, Thrive Capital led its $70 million seed round, with participation from Khosla Ventures, Lux Capital, OpenAI and Sequoia Capital; in November, Jeff Bezos, OpenAI, Thrive Capital, Lux Capital, Bond Capital, Khosla Ventures and Sequoia Capital jointly participated in its $400 million new round of financing, making its valuation reach $2.4 billion.
Previously, Jeff Bezos led the $675 million financing of Figure AI, the $300 million Series A financing of Skild AI, and Amazon also acquired the team of Covariant AI. OpenAI participated in the investment in Figure AI, as well as an early-stage investment of $23.5 million. In the field of robotics, investment institutions and tech giants are generally optimistic.
A group of scientists come together to build a general robot brain
The core team of Physical Intelligence comes from universities such as UC Berkeley and Stanford University, as well as top tech companies such as Tesla, Google DeepMind, and Stripe.
Karol Hausman, the co-founder and CEO, is a part-time professor at Stanford University and was a research scientist in robotics at Google Brain, with over 13,000 citations. Co-founder Sergey Levine is an associate professor at UC Berkeley and a top expert in robotics, with 150,000 citations. Co-founder Chelsea Finn is an associate professor at Stanford University, with 63,000 citations.
The founding team also includes former Google research scientist Brian Ichter and former Stripe executive and well-known tech investor Lachy Groom.
Physical Intelligence's vision is: Users can have robots execute any task they want, just like using a chat assistant supported by a large model.
What does a general robot basic model mean for the industry?
Currently, the application directions of AI can be roughly divided into two types: one is to interact with humans in the virtual space, and the other is to directly interact with the physical world. Interacting with humans in the virtual space includes chatbots, AI enterprise search and agents, legal AI, programming AI and other vertical industry AI.
Directly interacting with the physical world is mainly achieved through robots and autonomous vehicles. In the classification of robot applications, they can also be divided into specialized and general types.
Now, most robots belong to the "specialized" type, and these robots can adapt to a small number of changes in a limited environment, but it is difficult for them to cope with more complex and messy real environments like homes. There are also some general-purpose robots, such as some humanoid robots. They are designed to be able to handle most of the things that humans can do, rather than being limited to a certain limited scenario.
The structure of robots can be roughly divided into "brain", "cerebellum", "eyes" and "limbs". Among them, the "brain" is the central nervous system of the robot, responsible for understanding external instructions and making decisions, generally using general or specialized models; the "cerebellum" inputs the decision commands into the "limbs" and controls them, which is the control system; the "limbs" are the parts of the robot that directly interact with the physical world, which may be humanoid, canine or mechanical arms, or even a vehicle; and the "eyes" are the sensors that the "brain" perceives the outside world.
All these parts are being innovated and cultivated by large companies or top startups, but the "cerebellum", "eyes" and "limbs" have gradually matured in the previous waves of robotics, while the "brain" of robots is still in the primary stage.
For vertical scenario cleaning robots, spraying robots, delivery robots, and warehouse handling robots, they only have specialized intelligence corresponding to the vertical scenario, and their "brain" models can only understand and process situations in limited scenarios. Earlier specialized robots could only perform fixed actions and required a lot of human programming.
The general robot brain model can change this situation to a certain extent, allowing robots to learn and follow user instructions, making it very simple to program new behaviors, and allowing robots to adjust their behavior to adapt to the environment.
For any vertical robot startup, as long as they have a general robot brain model, combined with their own industry-specific data, they can fine-tune an robot brain that adapts to the specific application scenario. This logic is exactly the same as large language model + industry-specific data = powerful industry model.
On a deeper level, the general robot basic model is also very helpful for achieving general artificial intelligence (AGI). Now AI researchers have found that the effect of Scaling Law is weakening, and the reason is that AI models have encountered a "data wall" - almost all the high-quality data that currently exists has been trained, and the models lack more and better data. If there is a general robot model that constantly interacts with the physical world and constantly encounters and solves complex situations, it will continuously generate high-quality data, and eventually get closer and closer to AGI.
What new methods are needed to train a general robot basic model?
Physical Intelligence's current prototype general robot basic model is called π0 (pi-zero). It is trained on a wide variety of data and can execute various text instructions. But unlike large language models, it also integrates images, text and actions, and gains physical intelligence through actual operations accumulated in robot experiences, outputting low-level motor commands. It can control various types of robots, and can both accept prompts to perform the required tasks and be fine-tuned to adapt to complex application scenarios.
Physical Intelligence used some special training strategies when training the π0 model.
First is cross-device hybrid training, the π0 model uses internet-scale visual-language pre-training, open-source robot operation datasets, and its own precision task datasets collected from 8 different robots, so that it can perform a variety of tasks through zero-shot prompting or fine-tuning.
These datasets contain diverse tasks, each task showing rich motion primitives, different objects and various scenarios; these tasks also cover different dimensions of robot dexterity, and Physical Intelligence's goal in selecting these tasks is not to solve a specific application, but to provide the model with a general understanding of physical interaction - to lay the preliminary foundation for physical intelligence.
Secondly, internet-scale semantic understanding, the starting point of this training is a visual-language model (VLM). VLMs can effectively transfer semantic knowledge from the web, but they can only output discrete language tokens, while precise robot operation requires π0 to output motor commands at a high frequency (up to 50 times per second).
To achieve this flexibility, Physical Intelligence uses flow matching (a variant of a diffusion model) to enhance the VLM model, allowing it to output continuous action instructions; this forms a visual-language-action flow matching model, which is then further trained on high-quality robot data to solve a series of downstream tasks.
The final step is fine-tuning for precision operations, where more complex precision tasks require fine-tuning the model through high-quality data, similar to the fine-tuning process of large language models. Pre-training allows the model to master knowledge of the physical world, while fine-tuning enables it to perform exceptionally on specific tasks.
Of course, π0 is not the only universal robot foundation model, Physical Intelligence has tested π0 and some other universal robot foundation models under Zero-shot conditions, using some real-world tasks such as folding clothes, taking toast out of the toaster, and packing miscellaneous items, to evaluate the models' problem-solving capabilities. The results show that both π0 and the smaller π0-small significantly outperform existing models like OpenVLA in problem-solving ability.
For example, in tasks such as folding clothes, clearing the dining table, and assembling boxes, robots supported by π0 can separate entangled clothes and fold them neatly; they can place utensils or cups in the cleaning tray and put trash in the garbage bin; they can also pick up a flat cardboard box, fold it into shape, and insert the folded edges. These actions are not simple, step-by-step tasks, but rather complex household or production activities.
However, as of now, π0 is still only a prototype model, and the universal robot foundation model is still in its early stages. Physical Intelligence says they will continue to collect data and train the model to achieve new flexibility and physical capabilities.
In terms of commercialization, Physical Intelligence currently has no obvious moves.
China's robot industry needs core technology and more applications
Why are top investment institutions and tech giants like Jeff Bezos betting on robots? The answer is likely the one mentioned earlier - robots can be combined with AI to explore the physical world, generating a large amount of real and high-quality data, ultimately helping to achieve AGI.
In fact, not only are people investing, but they are also getting their hands dirty. In addition to Tesla's Optimus, Nvidia also has a series of general robot model tools called NVIDIA Project GR00T, and Amazon has Sparrow (warehouse robot system) and Digit (bipedal robot).
In terms of startups, Figure AI mainly develops the Figure 01 and Figure 02 humanoid robots, whose brains are composed of models customized by OpenAI, with strong versatility, capable of not only completing life skills like making coffee, but also "screwing screws" in factories.
Skild AI mainly develops Skild Brain and mobile operation platforms, where Skild Brain is a robot general brain similar to π0.
1X also develops the bipedal humanoid robot NEO Beta designed for home use, while Vayu One is Vayu Robotics' delivery robot, which also has the Vayu Drive mobile base model.
Currently, China still lags behind the US in core algorithms and advanced motion control systems for robots, but whether it's the "brain", "cerebellum", "eyes" and "limbs" of robots; various specialized robots and general humanoid/canine robots, there are large companies and top startups working hard to innovate and explore. These companies include Alibaba, Xiaomi, Xpeng, DJI, and Yushu, among others.
China not only has a huge market and rich application scenarios, but the density of robots is still not high enough, presenting a huge potential market demand. For robot entrepreneurs, even if they focus on the domestic market, they still have enough room for development, and after "winning the domestic market", they can further expand into the international market.
In terms of entrepreneurial direction, while breakthroughs are needed in the "brain", "cerebellum" and other basic and core areas, more innovators are needed to emerge in various application scenarios. The mutual promotion of applications and basic technologies can enable the healthy development of the entire robot innovation and entrepreneurship ecosystem. As an angel investment institution, Alpha Community hopes to discover extraordinary entrepreneurs in the field of intelligent robots and help the next world-class robot company grow and thrive.
This article is from the WeChat public account "Alpha Community" (ID: alphastartups), author: Discovering Extraordinary Entrepreneurs, authorized by 36Kr for publication.