According to Beating, Xiaomi Auto officially released a new framework for the Xiaomi EV World Model assisted driving world model, achieving deep coupling between 3D reconstruction and video generation modules for the first time. In autonomous driving simulation, traditional technologies often separate reconstruction and generation. The reconstruction module can restore the scene but cannot predict changes, while the generation module can predict the future but is prone to distortion and drift over long periods. The team proposed the JointWM architecture, which uses a 3D geometric structure as a physical skeleton to anchor the scene, and then uses the generation module to complete visual details and predict unobserved areas, setting multiple best performance records in mainstream benchmarks such as Waymo and nuScenes. In terms of specific mechanisms, the reconstruction module WorldRec abandons the traditional pixel-by-pixel paradigm and uses sparse 3D query points for scene representation, incrementally fusing them into a cross-view 4D Gaussian spatial skeleton, achieving rapid reconstruction of a 10-second video in 10 seconds. Based on the geometric priors provided by the reconstruction module, the generation module WorldGen is limited by the physical boundaries of the skeleton and is only responsible for generating reasonable lighting and textures. For content outside the boundaries of future frames and blind spots, the generation module performs physical prediction through a two-stage temporal training and distribution matching distillation mechanism. The entire architecture achieves a generation speed of 0.19 seconds for a single view and 0.46 seconds for three views on an H2O GPU, and supports video generation up to 1 minute in length. This solution achieved a PSNR of 28.48 in Waymo's reconstruction accuracy test and maintained a leading position in nuScenes zero-shot generalization. In terms of generation efficiency, the solution is 5.6 times faster than the autoregressive baseline Epona, and its spatiotemporal coherence ranks among the top in similar algorithms. Currently, the research results have been implemented in three major scenarios of Xiaomi Automotive, including delivering over 100,000 high-quality synthetic data segments for perception model training, constructing a highly realistic closed-loop simulation environment to reproduce long-tail road conditions, and launching an assisted driving training program to guide user operation with generative videos.
Xiaomi releases integrated world model reconstruction and generation framework, breaking mainstream benchmark performance records.
This article is machine translated
Show original
Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments
Share
Relevant content



