OpenAI makes a huge bet, the robot NEO world model makes its debut, is it the ChatGPT moment for robots?

avatar
36kr
2 days ago
This article is machine translated
Show original

[Introduction] Just now, 1X, a humanoid robot startup invested heavily by OpenAI, finally revealed the "world model" behind it - it can generate behavioral predictions for different scenarios based on real data! The ChatGPT moment in the field of robotics may really be coming.

Earlier this month, OpenAI invested heavily in the humanoid robot startup 1X and finally released the official announcement video for NEO.

Its first appearance astonished everyone.

Not only is it nicknamed "a man in a suit" in appearance, but also in terms of its abilities, it helps the heroine carry her bag and cooks with her, making it a truly general-purpose household robot.

It is designed for humans to complete various household tasks that we don't want to do, such as cleaning, organizing, etc.

After half a month, 1X finally released the "world model" behind NEO.

With this virtual world simulator, NEO can predict useful object interactions.

In short, they can generate video images in various environments.

For example, deformable objects such as folding a T-shirt and drawing back curtains are ubiquitous in our homes, but are difficult to put into a virtual world simulator.

Interestingly, Eric Jang, vice president of 1X AI, said that they placed a full-length mirror in the office so that the "model" could recognize itself in the mirror.

NEO now has the ability to reflect on itself, but its self-awareness has not yet awakened.

By understanding and interacting with the world, the 1X "world model" can generate high-fidelity videos and replan, simulate, and evaluate them in neural networks.

This is also the importance of world model to robots.

1X founder and CEO Bernt Bornich said that the first demonstration of humanoid robot data is significantly advancing the Scaling Law.

Ted Xiao, a senior researcher at Google DeepMind Robotics, said 1X’s “learning” world model can be continuously improved with amazing, physical interaction data.

- World models are likely the only way forward for repeatable and scalable evaluation in multi-agent environments. (See the successful case study of world model evaluation in autonomous driving)

- It will be easier to build a model of the world based on 2024 AI technology than it was based on last year’s technology.

- Once world models are good enough for evaluation, they are likely to have completed at least 90% of the training work.

The robot "world model" is here!

To put it simply, a world model is a computer program.

It is able to imagine how the world evolves as a result of the actions of intelligent agents.

Based on video generation and research on the world model of autonomous vehicles, 1X trained its own world model as a virtual simulator for NEO.

Starting from the same starting image sequence, the 1X world model can predict multiple possible future scenarios based on the actions of different robots.

Left: Go to the door on the left; Middle: Play air guitar; Right: Go to the door on the right

So, the most important thing for the existence of embodied robots is the ability to interact with the physical world.

In this complex world, how to interact effectively becomes a problem.

The world model can help NEO complete precise interactions, such as rigid bodies, the effect of falling objects, incompletely visible objects (cups), deformable objects (curtains, clothes), and articulated objects (doors, drawers, chairs).

It is possible to place the dishes in the drain rack.

It can also draw the curtains.

Take things out of drawers, etc.

The Embodied Robot Problem - Evaluation

Additionally, world models address a very practical but often overlooked challenge when building general-purpose robots: evaluation.

If you train a robot to perform 1,000 unique tasks, it would be hard to tell whether a new model actually improves on the previous model on all of them.

Even more troubling is that even if the model weights are the same, performance can degrade in just a few days due to slight changes in the environmental background or ambient lighting.

Researchers trained a model for a robot to fold T-shirts and found that its performance gradually degraded over 50 days.

Furthermore, if the environment is constantly changing, the reproducibility of the experiment becomes a problem.

In particular, the problem becomes more difficult when evaluating multitasking systems in environments such as homes and offices.

Due to these factors, it becomes extremely difficult to start rigorous robotics research in the real world.

When scaling data, computing power, and model size, the question of how the AI system capabilities will expand can be predicted through precise measurements.

Scaling Law has become a strong support for improving the performance of general AI systems such as ChatGPT.

Therefore, if the robotics field wants to have its own "ChatGPT moment", it must first establish its "Scaling Law".

Learn from raw data and predict future scenarios

Physics-based simulation engines such as Bullet, Mujoco, Isaac Sim, and Drake have become a reasonable way to quickly test robot strategies.

Moreover, these simulators can be reset and reused, allowing researchers to carefully compare different control algorithms.

However, these simulators are primarily designed for “rigid body dynamics” and require a lot of manual data collection.

So, how do you get a simulated robot to open a box of coffee filters, cut fruit with a knife, unscrew a jar of jam, or interact with humans or other AI agents?

In a home environment, common everyday objects and pets are difficult to simulate, and there is an extreme lack of real-world use cases for training robots.

Therefore, small-scale real/simulated evaluations of robots on a limited number of tasks cannot accurately predict their performance in the real world.

In other words, it is difficult for robots trained in this way to have the ability of "universal generalization" in the real world.

The 1X research team took a new approach to evaluating the use of robots:

Learn simulations directly from raw sensor data and use it to evaluate robot policies across millions of scenarios.

The advantage of this "world model" approach is that all the complex data of the real world can be obtained in one click without having to manually create assets.

Over the past year, the 1X team has collected over 5,000 hours of EVE humanoid robot data.

The data includes scenarios where the robot performs various mobile manipulation tasks and interacts with people in home and office environments.

They then combined the video and motion data to train a world model.

This model is so powerful that it can not only perform actions based on what it observes, but it can also generate videos and predict future scenes.

Controllable movements, play air guitar in your head

The 1X world model can generate diverse outputs based on different action instructions.

The following figure shows the various results generated based on four different action sequences. These action sequences all start from the same initial frame.

As before, these examples are shown and were not included in the training data.

The main value of a world model is its ability to simulate interactions between objects.

In subsequent simulations, the researchers provided the model with the same initial scenario and set three different sets of box-grabbing actions.

In each simulation scenario, the box that is grasped is lifted and moved as the robot moves, while the other boxes that are not grasped remain motionless and in place.

Even without specific action instructions, the world model can generate videos that look plausible.

For example, it can avoid pedestrians and obstacles while moving forward, which is a very reasonable behavior.

Simulate folding T-shirts, even for long-term tasks

In addition, 1X can also generate long videos.

As shown in the example at the beginning, NEO simulates a complete T-shirt folding demonstration.

It is worth mentioning that deformable objects such as T-shirts are often difficult to achieve in the "rigid body simulator".

Current Problems

However, 1X's world model also has some problems.

Object consistency

For example, when the model interacts with an object, it may not be able to maintain the consistency of the object's shape and color.

Especially when objects are occluded or presented at non-ideal angles, the appearance of objects may be distorted when the world model is used to generate the video.

Sometimes, objects disappear completely.

For example, when performing the action of picking up a red ball and placing it on a plate, the ball inexplicably disappeared during the process.

Laws of physics

Moreover, it does not understand the basic laws of the physical world.

Sometimes, NEO is able to have a natural understanding of physical properties, such as a spoon falling to the table after releasing the robotic gripper.

But in many cases, the generated results do not follow the laws of physics, such as the one below, where the plate is suspended in the air.

This shows that the world model does not understand that all objects are subject to the vertical downward force of gravity.

Self-awareness

In addition, the researchers asked the AI robot EVE to walk in front of a mirror to observe whether it would generate behaviors corresponding to what it saw in the mirror.

Unexpectedly, when it raised its other arm, there was no synchronization in the mirror.

It can be seen that the current 1X model has no self-awareness.

References:

https://x.com/ericjang11/status/1836096888178987455

https://x.com/1x_tech/status/1836094175630200978

This article comes from the WeChat public account "Xinzhiyuan" , edited by Taozihaokun, and published by 36Kr with authorization.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments