Can GPT-4o build Lego? The first multi-step spatial reasoning benchmark is here: closed-source models lead, but still far behind humans

04-23

This article is machine translated

Show original

Can GPT-4o draw Ghibli, take "selfies", but can it assemble Lego correctly?

Have you ever wondered:

Do multimodal large language models truly have the ability to understand and reason about spatial structures?
How do existing MLLMs actually perform on multi-step spatial reasoning tasks?

In recent years, with the rapid development of multimodal large language models, capabilities such as visual understanding, text-image alignment, and language generation have continuously breakthrough, making human assistants seem within reach.

But in scenarios requiring multi-step spatial perception and logical reasoning.

For example, in robot assembly, autonomous driving decisions, 3D object understanding, what is the real "spatial intelligence" of multimodal large models?

To this end, the Shanghai Artificial Intelligence Laboratory, in collaboration with Tongji University and Tsinghua University, proposed a new benchmark LEGO-Puzzles, using Lego assembly as a carrier, systematically evaluating the actual performance of existing multimodal large models (MLLMs) in multi-step spatial reasoning tasks.

The evaluation model includes GPT-4o, Gemini-2.0-Flash, and open-source models like Emu2, GILL, and Anole with image generation capabilities.

The results show that only Gemini-2.0-Flash achieved a medium or higher level on both indicators (App: 2.15 / IF: 1.17), maintaining a good balance between structural fidelity and instruction execution.

In comparison, GPT-4o's generation process is more like scene reconstruction based on instruction semantics, rather than step-by-step editing of input images. This strategy allows it to perform reasonably well in instruction understanding, but shows significant shortcomings in structural restoration, with generated images often deviating from the original image in details and overall structure, resulting in a significantly lower appearance score compared to Gemini-2.0-Flash.

It should be noted that this evaluation used the GPT-4o version before March 6, 2025, and the team is also testing the new version of GPT-4o's image generation capabilities, which will be updated in subsequent evaluations.

Emu2's image generation has high similarity to the original image appearance, but almost fails to reflect any operational changes, showing typical "image reconstruction" behavior and lacking response to task instructions.

GILL and Anole are essentially ineffective in all subtasks, with generated results unrelated to the target structure, and IF scores close to 0, indicating they lack effective capabilities in spatial understanding and execution.

Correct in one step, but confused in five? Multi-step reasoning makes models "blank out"

To more deeply evaluate the reasoning ability of MLLMs in complex spatial sequence tasks, the team introduced an extended experiment: Next-k-Step. This experiment builds on the original single-step task "Next-Step", further requiring the model to identify the correct final assembly state after performing multiple assembly operations continuously, simulating multi-step spatial construction reasoning closer to real-world scenarios.

In the experimental setup, the team controlled the number of assembly operations k from 1 to 5, gradually increasing the reasoning chain length and placing higher demands on the model's coherent modeling and state memory capabilities. The input includes the current LEGO state, the next k component images, the corresponding target image, and candidate options; the model needs to determine which is a reasonable assembly result. The team also introduced Chain-of-Thought (CoT) prompts to explore whether "step-by-step thinking" could improve reasoning performance in visual scenarios.

The results show that most models still have some reasoning ability when k=1, with GPT-4o reaching 75% (using CoT), and Gemini-2.0-Flash as high as 85%.

However, as k increases, accuracy significantly drops, with GPT-4o almost completely failing at k=4 and k=5, with accuracy falling to 0-5%.

Even with the introduction of CoT prompts, most models cannot maintain an effective reasoning path after k > 2, indicating that the CoT technique commonly used in language models is extremely limited in helping with visual multi-step spatial tasks.

Notably, Qwen2.5-VL-72B performs relatively stable across different step numbers, maintaining an accuracy of around 65%, demonstrating some structural memory capability; while InternVL-2.5-78B's accuracy is close to random in most scenarios.

These experiments reveal that: current mainstream MLLMs have obvious "reasoning decay" problems when handling multi-step spatial logic.

Summary

LEGO-Puzzles is a new benchmark designed specifically to evaluate the capabilities of multimodal large models in complex spatial reasoning tasks, covering 1100+ task instances across 11 subtask categories from static structure recognition to multi-step temporal reconstruction. The dataset supports both VQA and image generation, providing a complete evaluation path with multimodal input and diverse output for models.

The team conducted a systematic evaluation of 20+ current mainstream multimodal large models, comprehensively revealing their performance bottlenecks in three-dimensional space understanding, multi-step spatial reasoning, and instruction-driven image generation. The experiments further introduced mechanisms like Next-k-Step and CoT reasoning to probe the stability and generalization ability of models as the reasoning chain deepens.

LEGO-Puzzles has been integrated into VLMEvalKit, supporting one-click evaluation to quickly locate the spatial reasoning capability shortcomings of models.

Paper:

https://arxiv.org/abs/2503.19990

Github:

https://github.com/Tangkexian/LEGO-Puzzles

HomePage:

https://tangkexian.github.io/LEGO-Puzzles

This article is from the WeChat public account "Quantum Bit", author: Focusing on Cutting-Edge Technology, published by 36Kr with authorization.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content