DeepSeek gave AI a cyber finger, so it could see.

avatar
MarsBit
05-01
This article is machine translated
Show original

Text | Alphabet AI

On the eve of the May Day holiday, DeepSeek suddenly released a report on visual multimodal technology.

Before I clicked on it, I had a general expectation in mind, which was simply how far I could see and how clearly I could see it.

After all, over the past year, multimodal models have basically been moving in this direction. OpenAI talks about thinking with images, allowing models to crop, enlarge, and rotate images during inference; Gemini and Claude are also trying to enable models to handle higher resolution and more complex visual inputs.

The common assumption is that the more detailed the model's view, the stronger the visual reasoning will become.

But after reading DeepSeek's report, you'll find that they've taken a completely different path.

DeepSeek didn't focus on "making the model see more pixels"; they focused on a more fundamental problem.

Even if the model has clearly seen the target, how can you guarantee that the model and you are referring to the same thing during the reasoning process?

This is actually the most easily overlooked fatal flaw in multimodal reasoning.

When humans look at images, they can use their fingers to mark objects. For example, they can say "this person is so-and-so" or "that person is so-and-so." But how can a model know which one you're referring to?

The model can only use language to say "the one on the left," "the one above," or "this line." Once the visuals become more complex, the language references will drift, and the reasoning will fall apart.

So DeepSeek suggested, why not just give the model a "finger"?

It turns points and bounding boxes into the basic units of model thinking, allowing the model to reason while pointing at objects with this cyber finger.

01 From Continuous Vision to Discrete Symbols

In this technical report, DeepSeek raised a very interesting question. They believe that the real challenge of multimodal models is not seeing images, but rather consistently pointing to the same visual object during continuous reasoning.

For example, you tell your friend, "In the market, Grandma Zhang's stall sells the freshest vegetables." But there are so many elderly people in the market, which one is Grandma Zhang?

But if you simply point and say "that's it," your friend will immediately understand.

DeepSeek named this problem the "Reference Gap".

Over the past year, almost all cutting-edge multimodal models have been addressing the "perception gap" problem.

Imagine a photograph placed in front of you. If the photograph is too blurry or has too low resolution, you might not be able to see small print or distant details clearly. The same applies to AI. If the input image quality is insufficient or the processing method is incorrect, it will "not be able to see clearly," which is the perceptual gap.

Models like GPT, Claude, and Gemini continuously improve resolution by introducing high-resolution cropping, dynamic segmentation, and multi-scale processing, all in order to allow the model to see more details.

This direction is certainly valuable, but DeepSeek points out in its report that even if the model sees things very clearly, logical breakdowns can still occur in complex spatial reasoning tasks.

The problem lies in natural language itself.

There are more than a dozen dogs in the photo. If you say "the dog on the left", the model can't understand which one you're referring to.

Even more ingenious is that if you ask the model to count the number of dogs in a photo, the model can easily get confused during the reasoning process about which dogs it has already counted and which it hasn't.

The report also mentions extreme cases such as maze navigation, where pure language simply cannot accurately describe irregularly shaped paths and complex topological relationships.

Language, as a referential tool, is inherently ambiguous in continuous visual space. It excels at abstract concepts and causal relationships, but its expressive power is fundamentally limited in terms of spatial positioning and topological relationships.

But DeepSeek is a general-purpose language model, so how should this be addressed?

Thus, the "finger" mentioned at the beginning of the article came about.

Their core concept is "Visual Primitives," which specifically elevates the two most basic spatial markers in computer vision—bounding boxes and points—to the "smallest units of thought."

While older multimodal models could also draw bounding boxes and label objects, they only showed you the final result, proving "I found it." It's like submitting the answer to an exam without writing out the solution process.

Some studies have also shown that AI draws boxes during the thinking process, but the purpose is simply to "see more accurately," and the boxes are just an auxiliary tool. It's like using scratch paper when doing math problems; the scratch paper only helps you calculate more clearly, it's not part of the problem-solving process.

DeepSeek does something completely different.

They embedded these spatial markers directly into the model's reasoning process, making them an integral part of the reasoning. When the model thinks, it not only describes "I saw a dog" in words, but also outputs "I saw a dog, it is here: [[x1,y1,x2,y2]]".

This mechanism is called "point while it reasons" by DeepSeek.

DeepSeek

Every step of the model's thinking is anchored to the specific coordinates of the image.

The technical report gave such an example: the model starts from the starting point, explores, backtracks, and tries again, and finally outputs a complete coordinate path, with each coordinate corresponding to a point traversed in the maze.

In this way, the model won't "get lost" during inference. It won't be confused about what it's saying or referring to. Each visual object has a clear spatial anchor point, making the inference process traceable and verifiable.

This technological approach presents an interesting contrast to OpenAI's direction.

OpenAI explicitly mentions the concept of "thinking with images" in the official introductions of o3 and o4-mini, meaning that the model can incorporate images into the inference chain and process them through methods such as cropping, scaling, and rotation. The focus of this approach is to make the image itself part of the thought process, allowing the model to generate new images, modify existing images, and manipulate existing images during inference.

OpenAI's approach emphasizes general-purpose capabilities, with vision, code, search, documentation, and tool usage working together. The model possesses a powerful "vision workbench" that can flexibly handle various vision tasks.

DeepSeek's approach is more "symbolic." It incorporates coordinates into the thought process. The model explicitly writes the coordinates of bounding boxes and points in the inference text, turning visual objects into reusable anchor points for inference.

This results in OpenAI's visual reasoning occurring internally, with users only seeing the final answer and necessary explanations, while the intermediate visual processing remains a black box. DeepSeek, on the other hand, deliberately makes the intermediate visual anchors explicit, making the reasoning process completely transparent.

The advantage of DeepSeek's approach is that the inference process is easier to train, inspect, and score. This also makes it easier to design format, quality, and task-level rewards. Especially in tasks like mazes and pathfinding, it can provide more granular feedback on path validity, trajectory coverage, and other aspects.

The model not only learns to output the correct answer, but also learns how to reason using visual primitives.

02 Efficiency is the key

There is an easily overlooked but extremely important detail in DeepSeek's report: their model uses far fewer tokens than other cutting-edge models when processing images.

The report includes a comparison chart showing the number of tokens consumed by different models when processing an 800×800 resolution image.

Gemini-3-Flash has approximately 1100 entries, Claude-Sonnet-4.6 has approximately 870 entries, GPT-5.4 has approximately 740 entries, Qwen3-VL has approximately 660 entries, DeepSeek has approximately 361 entries, and only about 90 entries are retained in the KV cache.

The difference is significant. DeepSeek uses only one-third the number of tokens as Gemini, and its key-value cache entries are only about one-tenth.

How is this extreme efficiency achieved?

DeepSeek uses a mechanism called "Compressed Sparse Attention" (CSA).

You can think of it this way: if you show a family photo to a friend, you wouldn't say, "There's a red area starting from the 237th pixel from the left...", you would directly say, "My mom is on the left, and my dad is on the right."

DeepSeek-ViT first compresses the image into fewer visual tokens, and CSA then further compresses the representation of these visual tokens in the KV cache.

This mechanism was used in the DeepSeek-V4-Flash model and is now being applied to visual multimodal models.

The specific compression process is as follows. A 756×756 image contains 571,536 pixels. These pixels are first processed by ViT, which divides them into 14×14 patch sizes, generating 2,916 patch tokens. Then, 3×3 spatial compression is performed, compressing every 9 adjacent tokens along the channel dimension into 1, resulting in 324 visual tokens.

These 324 tokens are pre-populated in the large language model. Finally, the CSA mechanism compresses these visual tokens in the KV cache by a factor of 4, ultimately retaining only 81 entries.

From 571,536 pixels to 81 KV cache entries, the overall compression ratio reached 7,056 times.

Most large AI companies use brute force to pile on computing resources, while DeepSeek makes trade-offs at the information theory level, leaving only the most intuitive and easy-to-understand information.

The most direct result is that the reasoning speed has become much faster.

The number of image tokens directly affects the model's inference latency. During autoregressive generation, for each new token generated, the model needs to perform attention calculations on the key-value cache of all previous tokens. If an image uses 1000 tokens, then attention must be applied to all 1000 tokens each time an image is generated. If it only uses 90 tokens, the computational load is significantly reduced.

For applications requiring real-time response, such as robot vision, autonomous driving, and real-time video analytics, the improvement in inference speed plays a decisive role.

And it also uses less memory.

Key-value (KV) caching is a memory bottleneck for large model inference. Especially when handling long contexts or batch inference, KV caching consumes a significant amount of GPU memory. DeepSeek compresses its visual token KV caching to 90 entries, meaning it can process more images or handle longer multi-turn dialogues on the same hardware.

This is crucial for practical deployment. Many companies' multimodal models perform well in the lab, but encounter cost issues when deployed in real-world scenarios. The more tokens consumed per image, the higher the inference cost, and the fewer concurrent users can be supported. DeepSeek's efficiency advantages are amplified during large-scale deployments.

This also indirectly increases the model's context capacity.

If an image requires 1000 tokens, then only about 100 images can be displayed in a 128k context window. If it only requires 300 tokens, over 400 images can be displayed. This is crucial for scenarios requiring multi-image dialogue, long video analysis, and large-scale document understanding.

DeepSeek's models can process more images in a single conversation, compare and analyze dozens or even hundreds of images, and track long-term changes in videos.

The most critical factor is the training cost.

While the report primarily focuses on inference efficiency, this compression mechanism is equally effective during the training phase. Fewer visual tokens mean a smaller computation graph, faster training speeds, and lower hardware requirements.

DeepSeek has always been known for "achieving better results with fewer resources." From reinforcement learning training in R1 to the MoE architecture in V4, and now to visual multimodal learning, this philosophy of prioritizing efficiency has been consistently applied.

But here's a crucial question: Will compression result in information loss?

DeepSeek does not deny that compression leads to information loss. Its argument is that, for this set of spatial reasoning and counting tasks, the compressed representations remain sufficiently effective.

Each step of compression preserves the information most important to inference while discarding redundancy and noise.

In fact, the visual primitive mechanism of DeepSeek mentioned earlier is itself a form of information compression. A bounding box can accurately locate an object with just four numbers, and a point can be marked with just two numbers. The information density carried by these discrete symbols is far higher than that of the original pixels.

The experimental results show that this compression does not impair performance; on the contrary, it improves performance in certain tasks.

This suggests that for many visual reasoning tasks, the bottleneck is not that the vision is not clear enough, but that a suitable representation method has not been found.

This efficiency advantage also proves that multimodal intelligence does not necessarily require larger models, more computing power, or higher costs.

Since its inception, DeepSeek has always had an underlying principle: "True intelligence lies not in computing power, but in understanding the essence of a problem."

Once you truly understand what visual reasoning requires, you won't need so many tokens. Once you find the right representation method, you won't need such a large model.

From this perspective, DeepSeek's extreme efficiency is not the goal, but a byproduct. The real goal is to find the correct paradigm for visual reasoning. Efficiency merely proves that this paradigm is correct.

03 Unfinished business

In the limitations section of its report, DeepSeek candidly listed several issues with its current approach. These are not minor technical flaws, but rather point to the next stage of visual reasoning.

The first problem is trigger word dependency.

The report explicitly states that the current ability to "think with visual primitives" requires explicit trigger words to activate. In other words, the model cannot yet naturally and autonomously decide "when to draw a frame or add dots."

This means that the model has not yet truly learned to determine when to use visual primitives and when language is sufficient.

Ideally, the model should be able to make autonomous decisions based on the nature of the task. However, when a user asks, "Count how many dogs are in the picture," the model should automatically switch to visual primitive mode and use bounding boxes to assist in counting.

Technically, this requires building a metacognitive layer within the model. This metacognitive layer can assess the complexity of the current task, determine whether pure language reasoning is sufficient, and decide whether to invoke visual primitives.

DeepSeek has not yet implemented this metacognitive layer, but they have identified the direction. Future versions may allow the model to learn to autonomously determine its inference strategy, rather than relying on external triggers.

The second problem is resolution limitations.

The report mentions that, due to limitations in input resolution, the model does not perform well enough in fine-grained scenarios, and the output visual primitives are sometimes not accurate enough.

This issue is related to DeepSeek's efficiency-first strategy. To control the number of tokens, they limit the range of visual tokens to between 81 and 384. Images outside this range are scaled.

This design is reasonable in most scenarios, but it encounters bottlenecks in some tasks that require extremely high precision. For example, medical image analysis needs to identify tiny lesions, and industrial quality inspection needs to detect minute flaws; these scenarios have very high resolution requirements.

DeepSeek mentions in its report that this problem can be solved by integrating existing high-resolution methods. In other words, their visual primitive framework and traditional high-resolution cropping methods are not contradictory, but complementary.

I think DeepSeek could come up with a hybrid solution.

Specifically, for most routine tasks, compressed visual representations and visual primitive inference are used to maintain high efficiency. For local regions requiring fine-grained analysis, high-resolution cropping is dynamically invoked to extract more detailed visual information. This maintains overall efficiency while meeting local accuracy requirements.

The key to this hybrid approach is teaching the model to determine which regions require high-resolution processing. This brings us back to the earlier question of metacognition.

The third issue is cross-scenario generalization.

The report mentions that using points as visual primitives to solve complex topological reasoning problems remains difficult, and the model's cross-scene generalization ability is limited.

This problem is particularly evident in maze navigation and path tracing tasks. Although DeepSeek achieved 66.9% and 56.7% accuracy on its own built test set, surpassing other models, these figures are still insufficient.

More importantly, these tasks were all trained and tested on synthetic data. The mazes were generated algorithmically, and the path-tracing curves were also plotted procedurally. When the model encounters real-world topological reasoning problems, such as planning paths on real maps or tracing connections in complex pipeline graphs, its performance may degrade.

DeepSeek's approach leverages large-scale, highly diverse data to enhance generalization capabilities. They crawled 97,984 data sources, rigorously filtering them to retain 31,701, ultimately obtaining over 40 million samples. For maze and path-tracing tasks, they also designed various topologies, visual styles, and difficulty levels to cover as many variations as possible.

However, data diversity is only one aspect of generalization ability. Does the model truly understand the essence of topological reasoning, or has it merely memorized patterns from the training data?

Furthermore, DeepSeek's visual primitives are a new representation system, requiring specialized data formats, training processes, and evaluation methods. This is not fully compatible with the existing multimodal ecosystem.

Most multimodal datasets and benchmarks are designed based on the traditional "image + text" paradigm, without considering visual primitives. To evaluate DeepSeek models on these benchmarks, either the visual primitive feature needs to be disabled, or the evaluation method needs to be redesigned.

If other researchers want to reproduce or improve this work, they need to rebuild the entire data and training process, which is quite challenging.

The fact that DeepSeek was able to discuss these issues in its report demonstrates that they have a clear understanding of their work.

This may be more valuable than providing a perfect answer. Because what truly drives social progress is often not the answer, but the question.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments