How can embodied intelligence reach the "ChatGPT moment"? The Dean of the Academy of Artificial Intelligence, a Tsinghua University professor, and three founders discussed this.

This article is machine translated
Show original

Text | Fuchong

Edited by Su Jianxun

Embodied intelligence is awaiting its "ChatGPT moment." However, there is still much disagreement within the industry regarding the specific definition of this moment.

Recently, at the Force Intelligence Technology Open Day roundtable forum, five leading figures from AI industry, academia, and research shared their insights on this issue. They are:

Wang Yu, tenured professor in the Department of Electronic Engineering, Tsinghua University

Wang Zhongyuan, Director of Beijing Academy of Artificial Intelligence

Jiang Daxin, Founder & CEO of Jieyue Xingchen

Gao Jiyang, Founder & CEO of Xinghai Chart

Tang Wenbin, co-founder & CEO of Yuanli Lingji

Jiang Daxin, founder and CEO of Jieyue Xingchen, first proposed the definition standard of "ChatGPT moment", which is "zero-sample generalization" - even if given instructions that have never been seen before, AI can answer questions and complete tasks - which is precisely the ability of large language models.

However, Jiang Daxin immediately pointed out that because the generalization of embodied intelligence involves more dimensions such as scenarios, tasks, and manipulated objects, it is still very difficult for robots to reach this standard.

As the CEO of a robotics startup, Gao Jiyang further explained the difficulties in commercializing embodied intelligence: large language models can be "model as product", with mobile phones and computers as terminals and the Internet as channels; embodied intelligence, however, must go through a longer industrial chain - complete machine, supply chain, real machine data, offline delivery, none of which can be missing.

Based on the above-mentioned problems to be solved, Tang Wenbin, co-founder and CEO of Yuanli Lingji, proposed a more attainable "ChatGPT moment of embodied intelligence": first, solve all the problems in a closed loop within a limited scenario, and calculate the ROI.

His reasoning is simple: ChatGPT constantly demonstrates the usability of language models as tools; for this change to occur, embodied intelligence must also transform from a toy and research project into something useful.

Therefore, this roundtable reached a preliminary consensus on "the current development direction of embodied intelligence": before pursuing stronger generalization, we should first test a vertical scenario, let the robot generate real data flywheel in actual work, and then use the data to feed back into the model and system iteration.

This line of thinking also explains the path chosen by the organizer of this roundtable forum, Force Intelligence: before the data flywheel turns, there needs to be a unified standard that can evaluate the effect of real devices. Therefore, before releasing its own model and the actual device, Force Intelligence first partnered with HuggingFace to launch the real device evaluation benchmark "RoboChallenge".

Founded in March 2025, Yuanli Lingji was established by Tang Wenbin, a former co-founder of Megvii Technology. The company's core team also includes several former core members of Megvii. In less than a year, Yuanli Lingji has raised nearly 1 billion yuan in funding, with shareholders including Alibaba, NIO Capital, and Lenovo Capital.

On February 10th, this startup, favored by the capital market, submitted its first model, DM0, which topped the RoboChallenge leaderboard with 2.4B parameters. Of course, questions arose – "Can the person who initiates the benchmark also be a competitor?" Tang Wenbin addressed these questions at the roundtable forum, elaborating on the considerations of releasing the benchmark before the model, the importance of real-device testing, and industry inquiries.

The following is the content of this roundtable discussion, compiled by the author:

△Roundtable forum guests, photo: Force Intelligence

Host: From a global perspective, what are the mainstream technological approaches to our embodied intelligence model, and what stage are we currently at?

Wang Zhongyuan: Behind the hype surrounding embodied intelligence, I see many hidden concerns. Although the hardware itself is advancing rapidly, there are still a series of issues to be resolved, such as continuous and stable operation, security, and battery life.

Regarding models, although we have released a series of embodied models over the past year, we feel that we are still far from the moment for embodied ChatGPT . Especially after embodied intelligent models and hardware were deployed on real devices, we found that there is still a significant gap between them and the large-scale applications we truly hope for.

Currently, the technological approach to embodied models is still under development. Commonly discussed approaches include modular models like VLM with control, end-to-end VLA, and the currently popular world model. However, I believe these are far from the stage where we can proudly say that embodied intelligence has achieved a complete breakthrough.

Therefore, it's highly likely that what we'll see next is that we'll solve each scenario one by one using VLA and reinforcement learning. We'll start by doing real work, accumulating more data on real devices to form a data loop, and then finally address the generalization problem.

Wang Yu: I work more on hardware, including computing power, frameworks, edge computing, and infrastructure. From my perspective, while current robotic applications have made great progress, they are still limited to a single workbench. Basically, it becomes quite difficult to coordinate the cerebral and cerebrospinal functions to complete a slightly longer task that spans multiple modalities.

Our group discusses how much the robot should actually do its work. For example, the task of cleaning a house isn't just about folding clothes; it's about the robot observing the overall state of the house, figuring out how it should be cleaned, and then starting to work on it step by step. The goal is to eventually clean the entire house thoroughly, which is a very difficult task.

Of course, the model needs to be groundbreaking, but I'm also wondering if the building itself will need to change to handle such complex tasks. I come from a hardware background, so sometimes I think about whether the architecture should be adapted to a future with robots, since it was originally only designed for human life. Just like vehicle-to-everything (V2X) systems, we can also create infrastructure to help robots.

Host: Professor Wang is talking about how our next-generation housing standards might incorporate robotics. Since we've touched on infrastructure, Professor Wang, what are your thoughts on the current strengths and weaknesses of China and Silicon Valley in the field of embodied intelligence?

Wang Yu: The United States started earlier in terms of models and data, and has made some investments and breakthroughs in applications. However, when it comes to implementation, I firmly believe that China can catch up quickly, especially since China has already made stronger investments than the United States in the embodied dimension.

Many people say that embodied technology is a bubble, but personally, I think it's a good thing that we've finally found a direction, and China's investment intensity is greater than that of the United States. This is because China's entire industrial and supply chains are complete. If we open up more applications and increase investment in models and applications, it's possible that we can achieve breakthroughs in the field of embodied technology faster than the United States.

Furthermore, I feel that there is gradually more collaboration between academia and industry in China now, as I am doing here myself. It's not that professors are sitting in their offices reading papers and doing research, but rather that industry encounters problems and then collaborates with research institutions. This collaborative approach, I personally think, is gradually aligning with the US model, where industry, academia, and research work together to advance embodied technologies.

Host: We've observed a phenomenon. The Super Bowl, often hailed as the American equivalent of the Spring Festival Gala, featured a lot of LLM (Master of Management) promotion. However, in China's Spring Festival Gala, almost everyone on stage is a robot. Dean Zhong Yuan, do you have any comments on this topic?

Wang Zhongyuan: Let me share two short stories I heard.

The first story is a little tale told to me by an investor. American investors in embodied intelligence often look for Chinese members in startup teams. They believe that the presence of Chinese members ensures the startup's potential success in embodied intelligence.

Another anecdote is that when we were iterating on our embodied intelligence model, a very painful aspect was the frequent hardware failures. When hardware broke down, the repair process often took two weeks. However, we heard that in the US, robot hardware could take three months to repair, which instantly made us feel much more at ease.

Therefore, on the one hand, we can see that China does have an advantage in manufacturing, which is an advantage for us in the field of embodied intelligence. On the other hand, the entire industry is still in its early stages, and everyone is in a phase of rapid development and iteration, so it is far from time to determine who is superior or inferior.

Host: We've discussed the "Chinese content" metric for embodied entrepreneurship in the US. Looking at the AI ​​industry as a whole, a crucial milestone is the "ChatGPT moment." So, what do you think constitutes the "ChatGPT moment for embodied intelligence"? Mr. Jiang Daxin, who has achieved remarkable success, do you have a deeper understanding and insight into the "ChatGPT moment"?

Jiang Daxin: Let's start by defining the "ChatGPT moment." I think the most iconic feature is "zero-shot." It generalizes with zero samples; given any instruction, even one it hasn't seen before, the AI ​​can answer the question. This is completely different from traditional natural language processing, which is why the "ChatGPT moment" is so exciting.

However, if we compare natural language and embodied intelligence, I think the "ChatGPT moment of embodied intelligence" will be more difficult.

First, regarding the definition of the problem itself, I think the generalization of embodied intelligence can be defined from different dimensions. Different dimensions of generalization lead to a lack of consensus among different people regarding the "ChatGPT moment of embodied intelligence".

The first dimension is the generalization of the scenario, such as whether it is a closed scenario, a semi-closed scenario, or a fully open scenario; the second dimension is the task, such as a navigation task, a grasping task, or a household chore; the third dimension is the generalization of the target, such as even a simple grasping action, the grasped object can be divided into steel and flexible.

Secondly, from a technical perspective, embodied intelligence involves computer vision, but there is still a lack of consensus on some fundamental issues. For example, how exactly should vision be encoded, how should self-supervised pre-training be performed, and how should inference be conducted in 3D space? I think these issues still require breakthroughs before we can reach the ChatGPT timeline.

Host: The definition is crucial for "ChatGPT moments of embodied intelligence". So how do you two guests, who are talking about embodied intelligence, define the ChatGPT moment of embodied intelligence?

Gao Jiyang: I think this is a particularly worthwhile issue to discuss. I think we may have a more fundamental problem, which is that although the industries of embodied intelligence and language models both originate from innovative breakthroughs in AI technology, they are quite different when you look at them in the specific industries.

Embodied intelligence has a longer chain, from technology development to product planning and commercialization. It involves upstream and downstream component supply chains and data, and the data for embodied intelligence was previously unavailable. Then comes algorithm development. Furthermore, it becomes apparent that the channels and terminals differ from those of large language models. Large language models are distributed through mobile phones and computers, and their channels are social media.

Therefore, you'll find that the most scarce, and only missing, link in the entire industry chain for large language models is the model itself . So, the model is the product; once the model is good, the entire commercialization and industrialization chain begins to take shape.

In the areas mentioned earlier, embodied intelligence faces challenges in its supply chain and component manufacturing. Without complete systems, there's a lack of reliable real-world data. The terminal for embodied intelligence is the robot itself, which necessitates the development of offline channels.

Returning to the previous question, regarding the definition of the "ChatGPT moment of embodied intelligence," I believe that from the perspective of business production lines, it should be a moment when we truly see that it has commercial value within certain limited scopes.

I think 2026 will be a year of change, because the entire machine and supply chain have undergone many changes after two years of preparation. We also have a lot of data, and the introduction of models, algorithms, reinforcement learning in post-training, VLA in pre-training, and the recent World Model have all brought many new changes to the generalization of our pre-training and the success rate of post-training.

Therefore, I believe this year is the year when applications need to close the loop. In the first half of last year (2025), we clearly saw that the development of intelligence had begun, and in the second half of 2025, intelligence accelerated significantly. We can refer to the number of open-source models in the open-source community as a key indicator.

2026 will be a year of explosive growth for smart technology. This growth will inevitably lead to spillover effects in certain application areas, while simultaneously impacting the supply chain and the entire device manufacturing process. China, in particular, is significantly stronger than the United States, with a cycle time that is 5 to 10 times faster and costs that are 5 to 10 times lower, as mentioned earlier.

Tang Wenbin: I think Jiang Daxin's "ChatGPT moment" has very high requirements; this is already an AGI moment. Today, let's think about what the biggest shock ChatGPT brought us was. We used to treat it as a toy, but at that moment, we realized it was a tool; it became something usable.

Therefore, my definition of the "ChatGPT Moment of Embodied Intelligence" is the moment when it becomes useful and trustworthy. This still comes back to what our company's mission wants to do.

Our definition of useful is very simple: it can be used in a limited scenario. But to truly solve all problems in a closed loop, we need to be able to calculate the ROI clearly. Only by calculating the ROI clearly can it be applied in batches.

Only when such a useful definition is met can we truly transform a toy, or a research project, into a tool. I believe that's the moment of "Embodied Intelligence ChatGPT." I think the capabilities of current models have indeed progressed very significantly, so it's not far off.

Of course, after the ChatGPT moment, there will be the DeepSeek moment, which means when it will truly go mainstream. Today, embodied intelligent robots can tighten screws in warehouses and factories, but I think the general public can't really perceive it. Perhaps the DeepSeek moment will be one where everyone feels it. As for how to move from industrial logistics to commercial applications and to consumers, that moment will come a little later, but I don't think it's too far off.

Host: During your time at Megvii, the core team behind Force Intelligence experienced the 1.0 era of AI. Now, we've entered the era of embodied intelligence. Instead of releasing a model at the beginning, you first released RoboChallenge as a benchmark. So, how did you approach this issue?

Tang Wenbin: A model is a product; its results, including the model, algorithm, architecture, and data, are all constantly changing. Currently, there is a significant lack of a complete technical architecture, whether it's data, user-friendly hardware as mentioned by Dean Zhong Yuan, or evaluation standards.

In today's embodied intelligence industry, all of us who work on algorithms know that if you don't know how to evaluate it, you certainly can't improve it. Today, the evaluation standards we can use might be LIBERO, SimplerEnv, and RoboTwin, but they are relatively small in scale. Many benchmarks have been thoroughly tested and refined, but does a score of 99.something represent the current true capability? Obviously not.

Therefore, we believe that we desperately need large-scale, real-world evaluations based on the physical world to guide us forward.

ForceMed has invested a lot of effort in building the infrastructure on our Dexbotic embodied framework, hoping to release some features and contribute to the industry. Although we are the initiators of RoboChallenge, everyone, including Dean Zhong Yuan, Gao Jiyang, and Professor Wang from Tsinghua University, is working together on this assessment, hoping that more people in the industry will join in promoting it.

Host: Several of our guests today are partners of RoboChallenge. As one of the first companies to join, Xinghai Map donated hardware to RoboChallenge. What was the rationale behind this?

Gao Jiyang: But truly application-oriented and practical evaluation standards must be based on real devices.

I think the entire development of ChatGTP or language models is driven by commercial demand. There is a huge demand in the three major verticals of Agentic, Coding, and ChatBot.

Looking back at embodied intelligence, we will see the formation of vertical categories in the future. These vertical categories must come from real needs. These real needs need to be reflected in the evaluation of actual devices in order to create a fair and iterative environment for R&D companies and future demanders.

AI is still largely an experimental science. It has certain principles and mathematical support, but ultimately, many things still need to be tested. "Testing" requires feedback, and feedback requires evaluation. A very important indicator that determines the success of a company or organization, including AI and others, is its iteration efficiency. Therefore, we try every means to improve this iteration efficiency and the quality of feedback. This is why I strongly agree with and support RoboChallenge when my senior colleague mentioned wanting to do it.

Because we have our own internal benchmark system, where everyone iterates through 10 different scenarios. I think we should also be able to have a universally applicable standard for the entire industry, and even involve academia to better link industry and academia.

Host: RoboChallenge is very important, but its initial format was a bit strange. It was like high-achieving students creating their own test questions and then taking the exam themselves. How does Professor Wang evaluate this behavior of the students? (Editor's note: This refers to Yuanli Lingji, which is both the initiator of the Benchmark evaluation standard and a company that participated in the evaluation and achieved good results.)

Wang Yu: I think the learning model may change in the future. It may not necessarily be teachers teaching; students may learn on their own. This is something we've been discussing with our colleagues at the university recently. The future development of universities may not really be about teachers teaching classes. Teachers may just be there to give exams, but the inspiration for the exam questions can come from the students, and there's no problem with that.

Going back to the point, we've actually done a great job in Beijing by organizing the Yizhuang Robotics Competition. We have two conferences and one competition, including a marathon, a robotics conference, and a sports meet. Originally, the focus was more on testing the robot's physical capabilities, but now we're gradually adding some intelligence-related tests.

However, this method is often infrequent, maybe only once or twice a year. So, I really appreciate the possibility of conducting a real-device test that can be performed anytime, anywhere, in a relatively fair environment and testing scenario.

Making this high-frequency, online, or anytime-based activity a reality is something I think is definitely worth continuing to develop.

There are actually more than a dozen "good students" (participants) working together to build this RoboChallenge platform. Everyone on this platform has a public welfare mentality in mind, and they compete in this environment.

When it can be presented in a more public-benefit format is something we can continue to discuss. Building a public-benefit organization inherently takes a lot of time. However, from its inception to frequent real-world testing, to everyone contributing to all the different scenarios—including industry, robotics, and academia working together to define these scenarios—and then how to create a fully open-source ecosystem, this whole process will be a huge boost to the entire industry. Therefore, I think this is something that is definitely worth continuing to work on.

Tang Wenbin: I'd like to interject. We were indeed discussing this issue internally when we released the DM0 model. RoboChallenge was released jointly with Hugginface, and although many peers participated, we were still the initiator. So, we debated for a while whether ForceMage should submit its own model and whether it should release the results. We had a heated discussion and differing opinions.

Wang Yu: OpenAI also has its own benchmarks, and they publish their own results after testing. I don't think there's any contradiction in that.

Tang Wenbin: Because OpenAI did the same thing, we were quite comfortable with it. This time, our requirement for the team was that the open-source work must be very thorough. We want to ensure that everyone who downloads our code, DM0 model, and Dexbotic (development framework) can directly submit it to RoboChallenge and receive the current score. This is a very transparent matter for us, so everyone should just do it openly and honestly.

Host: Let's end with some predictive questions. Looking towards 2026, what are the most anticipated developments or challenges in the field of embodied intelligence, and what will be the most anticipated outcomes?

Wang Yu: From the perspective of the Department of Electronic Engineering, I really hope to develop a cloud-edge-device collaborative system that can transform architecture and build infrastructure for a symbiotic environment between machines and humans. I think a prototype of this solution might emerge this year, and then we can discuss it together.

Wang Zhongyuan: Although I have high expectations for hardware and models, what I am most looking forward to in 2026 is probably the standards.

Because I think that the current ecosystem, including hardware standards, data standards, and model output standards, is very fragmented, I am really looking forward to some breakthroughs in standards in 2026, which could greatly promote the development of the entire industry.

Because Zhiyuan participated in RoboChallenge, I was deeply impressed. When I was talking with Wenbin, we talked about how everyone collected their own data, and even the formats and codes were inconsistent. This directly led to the fact that the models were difficult to validate repeatedly. To be honest, we tried to download and validate many models recently released at home and abroad, and we found it quite difficult to deploy them. This was largely because everyone's standards were not unified.

In 2026, since the Academy of Humanoid and Embodied Intelligence is also on the standards committee, there is a high probability that we will take the lead in developing standards for embodied intelligence.

Jiang Daxin: I was very inspired by Wenbin's sharing. If we can achieve zero-shot generalization in any scenario, any task, and any goal, that will be the "AGI moment".

In 2026, I most look forward to the collaboration between Force Machines and Leap Star, to achieve the ChatGPT moment that Wenbin mentioned: being able to complete tasks reliably and effectively.

If Wenbin feels that this task is not challenging enough, then we will achieve the ChatGPT moment in the first half of the year and the DeepSeek moment in the second half of the year.

Gao Jiyang: I think we still expect to see a clear growth path in terms of productivity by 2026. Then, within two years, we hope to see a single scenario achieve shipments of tens of thousands of units. I think this is something the entire industry urgently needs.

Tang Wenbin: My goal is a little smaller than Gao Jiyang's. I hope to see a thousand units running continuously in one scenario.

What I want to say here is that continuous operation is the most crucial thing, and it shouldn't be done by adding too many scenarios. It's not about adding more. If a thousand units are running continuously in one scenario, then to some extent, we have already completed a closed loop for the scale of a scenario. I think we have a chance in 2026.

Cover image source | AI generated

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments