Strong generation, weak reasoning: GPT-4o’s visual shortcomings

avatar
36kr
04-21
This article is machine translated
Show original

If AI is asked to draw a dog on the "left" side, but told in advance that "left is right", do you think it would understand?

Recently, a new study from UCLA, through a series of carefully designed experiments, revealed the shortcomings of GPT-4o in image understanding and reasoning - it can draw beautifully, but may not truly understand what you mean.

The main line of the paper is straightforward: GPT-4o's drawing ability is indeed impressive, but when it comes to understanding images, contextual reasoning, and multi-step logical chains, there are still obvious limitations.

This reminds me of the subtle awkwardness of AI that "looks capable, but is actually still lacking".

As usual, I'll explain the three major experimental parts one by one, hoping to help you fully understand what this research discovered.

01. Failure in Following Global Rules

This part is quite interesting, similar to joking with a friend: "From now on, when I say left, I actually mean right", and then asking them to "take a step to the left" to see if they'll actually go right.

UCLA researchers set a similar trap for GPT-4o: "From now on, 'left' means 'right'", "all numbers should be reduced by 2", and then asked it to "draw a dog on the left side" and "draw 5 birds".

They expected AI to infer and adapt, but instead—

The dog was still on the left, there were still 5 birds, completely ignoring the previously redefined rules.

What does this indicate?

GPT-4o still interprets instructions literally when generating images, with global redefinition and previous settings unable to penetrate its "drawing brain".

You want it to be flexible, but it only faithfully executes the surface, which is far from human cleverness.

02. Image Editing: Exposure of Shallow Semantic Understanding

The second part of the test was more challenging, with researchers asking GPT-4o to edit images.

For example,

"Only change the horse's reflection in the water to a lion, without touching the horse itself."

As a result, both the horse and its reflection changed.

Another example,

"Only delete the sitting person in the picture."

The result was that standing background people were also removed.

These examples directly exposed a problem:

GPT-4o completely fails to grasp the nuances of "partial modification" and "semantic limitations".

It cannot precisely distinguish between "reflection" and "entity", "sitting" and "standing", often operating excessively and modifying the wrong areas.

In plain terms, AI's image editing understanding is far from the fine-grained human ability to "look at an image and understand the scene".

It's like asking a Photoshop novice to edit an image, working without concept, purely by guessing.

03. Multi-step Reasoning and Conditional Logic: Completely Falling Short

The most fatal weakness appeared in "multi-step reasoning" and "conditional judgment".

For instance,

First, ask GPT-4o to draw a dog and a cat, then tell it: "If there's no cat, replace the dog with a cat and move it to the beach."

But in fact, a cat was already present in the first image.

Logically, AI should do nothing in this case.

However, it still replaced the dog with a cat and moved the entire scene - completely misinterpreting the condition and creating illogical results.

Similar examples abound, with AI often unable to understand complex conditions or simply "executing every instruction" regardless of potential conflicts.

This confirms a core issue:

GPT-4o lacks context-sensitive reasoning ability and cannot make intelligent judgments in complex image editing tasks.

It is clearly far behind in the chain of "understanding premises - logical judgment - then action".

Overall, current AI is more like a "sophisticated instruction machine" - it draws whatever you ask, but to make it "understand rules, comprehend scenes, and infer by analogy", it still needs several more rounds of evolution.

This reminds me of when AI first learned to generate text, and everyone thought it "could write and speak", but when asked for details, to tell stories, or to maintain logical coherence, it would still produce bugs of various sizes.

The predicament GPT-4o faces in the image domain is almost identical to the early text AI:

Can draw, but not necessarily "understand"; can modify, but not necessarily "precisely"; can follow instructions, but not necessarily "infer by analogy". This might be the most cautionary and anticipatory barrier between us and an AI that truly "understands the world".

Perhaps the next technological breakthrough will start from here. But for now, we're not quite there yet.

via

https://the-decoder.com/gpt-4o-makes-beautiful-images-but-fails-basic-reasoning-tests-ucla-study-finds/

This article is from the WeChat public account "Big Data Digest" (ID: BigDataDigest), authored by Digest Bacteria, published by 36kr with authorization.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments