The menu text is finally correct: ChatGPT Images 2.0 is one step closer to human designers.

This article is machine translated

Show original

Two years ago, you had the most powerful AI image model at the time generate a restaurant menu.

The menu is out; the layout is beautiful and the color scheme is correct, but all the dish names are garbled.

Two years later, the same prompts were given to ChatGPT Images 2.0, and the resulting menu was ready for printing. Not only was the text accurate and the price reasonable, but even the layout and spacing looked like it was done by a real designer.

What happened in the last two years? OpenAI believes that the problem that has remained unsolved in the past is called the "intent gap": there is a gap between what the user wants in their mind and what ultimately appears on the screen.

The newly released ChatGPT Images 2.0 addresses this issue, and while it's not a complete solution, it's enough to get some users to start using it.

01 How does OpenAI define this update?

The official feature list for ChatGPT Images 2.0 includes: faster speeds, more accurate text rendering, multi-language support, and a new Thinking mode. However, simply calling Images 2.0 a "better image generator" clearly underestimates OpenAI's ambitions.

OpenAI internally positions this product as specifically designed to bridge the "intent gap" in AI image generation. The intent gap refers to the long-standing divide between what the user wants and what is ultimately generated.

There is a fundamental shift behind this:

Previously: You describe → AI generates

Images 2.0: You describe → AI understands your true intentions → AI autonomously researches and plans layouts → AI generates images and self-censors them before delivery.

The two extra steps in the middle are the real focus of this release.

02 Thinking Model: What is it doing?

According to OpenAI, Thinking mode gives the model three new capabilities:

Networked search : Upon receiving a task, the model can proactively retrieve relevant reference materials instead of simply relying on training data. This means it can handle visual needs related to brand guidelines, the latest product information, and current events.

Parallel generation of multiple schemes : Generate up to 8 coherent images that maintain "character and object consistency" under a single prompt. This represents a substantial workflow change for the mass production of comic storyboards, social media series images, and brand materials.

Pre-generation self-check : The model checks its own draft to ensure it meets the requirements before the final output. This step was completely missing before—whatever the AI generated was what it was, without any "quality control" process.

The combination of these three elements makes the entire workflow more like that of an assistant designer, rather than a mechanical tool that "accepts instructions and outputs accordingly."

The Thinking mode is currently only available to ChatGPT Plus, Pro, and Business users. Free users use the basic mode, which has different generation logic and results. This has been a point of confusion in many reviews, leading to significant discrepancies in comparative conclusions.

03 Textual Rendering: Why is this the most underestimated progress?

AI image generation has been developing for several years, but text rendering has always been its most obvious weakness. The reason lies in the technical architecture itself: traditional diffusion models generate images on a pixel-by-pixel basis, and text information accounts for a very small proportion of the training data, so the model has almost no chance to "learn" how text works.

The significant advancement of Images 2.0 lies in its ability to handle tasks that were previously virtually impossible:

• The restaurant menu is completely correct in terms of dish names, prices, and layout.

• Dense UI screenshot reproduction with clear text hierarchy

• Multilingual infographics, including Chinese, Japanese, Korean, Hindi, and Bengali.

This last point is of paramount importance to Chinese users. A hidden language gap has long existed in AI visual content production: English-speaking users can use AI to create precise marketing posters and brand materials, while non-English-speaking users often encounter typos and garbled characters, forcing them to give up or seek human assistance.

If Images 2.0 truly and stably solves this problem, it will essentially be delivering industrial-grade visual production capabilities more equitably to non-English speaking users worldwide. For design professionals and SMEs in Southeast Asia, South Asia, and East Asia, this will represent a real-world workflow change.

Of course, there is still a gap between "significant progress" and "complete resolution." Test results show that rendering in non-English languages remains unstable, with a higher error rate under complex layouts compared to English.

04 Architectural Issues: Why Doesn't OpenAI Answer This?

At the media briefing before the launch, OpenAI refused to answer questions about the underlying model architecture of Images 2.0, refusing to mention whether it was a diffusion model or an autoregressive model.

Traditional diffusion models have a structural upper limit to their text rendering capabilities, while Images 2.0's text understanding and command following capabilities have exceeded this limit in terms of performance.

One reasonable speculation is that Images 2.0 is more deeply integrated with the language model architecture of GPT-4o than it was in the DALL-E era, and its visual output capability is closer to an "extension" of the language model than a standalone image generation system.

But this is ultimately speculation. OpenAI's decision not to disclose this information may be due to considerations of commercial competition, or it may be because the model is still iterating. The only thing we can be sure of is that its performance on certain tasks has exceeded the boundaries that existing architecture classifications can predict.

05 Grayscale testing details: code name "duct tape"

Prior to its official release, Images 2.0 quietly launched on the third-party AI testing platform LM Arena under the codename "duct tape," where it ran publicly for several weeks to collect real user feedback.

This detail reflects a shift in OpenAI's product release strategy, moving from "holding back big moves and releasing them with a single click" to "letting real users try them first before officially launching them." This is a more engineered and risk-controlled approach.

The codename "duct tape" is intriguing in itself; tape implies a temporary connection, forcibly gluing together two mismatched parts. This might just be an arbitrary internal name, but it could also suggest that OpenAI still holds a certain humility towards this current version: it's a phased solution, not the final destination.

06 Competitive Landscape: The Real Competitor is Not Midjourney

In the market, Google's Gemini 3 Pro Image, released in February 2026, also has the ability to embed text into images, and it is on par with Images 2.0 in some tasks. Midjourney still has its unique advantages in artistic style generation.

However, describing this competition as a "battle between image generation models" is a complete misunderstanding.

Images 2.0 is truly squeezing the market space of another type of tool: Canva's template editor, Adobe Express's rapid design capabilities, and the low-complexity material needs of small design studios. OpenAI itself has specified that its target application scenarios are localized advertising, infographics, educational content, and brand materials—the daily basics of commercial design, rather than the periphery of artistic creation.

This positioning means that its potential users are not primarily creative designers, but rather people who need to produce a large amount of visual materials every day but do not have dedicated design resources: brand operators, marketing specialists, content editors, and independent entrepreneurs.

07 Unresolved Issues

Rendering stability: Instability persists in rendering text in non-English languages, and the error rate for complex layouts outside of English remains higher than expected. There is still a significant gap between "improvement" and "complete resolution."

Data cutoff date: The model's training data is current as of December 2025. While the Thinking pattern can be searched online, the mechanism connecting search quality and final image quality remains opaque. Results may be biased for visual applications requiring reference to the latest events or data.

Content security: OpenAI specifically emphasized image watermarking and real-time content monitoring. This is because AI-generated visual content has already been used for political propaganda and disinformation. Greater generation capabilities and the risk of more difficult-to-detect misuse are two sides of the same coin. Technological iteration alone cannot solve this problem.

08 Conclusion

Since the release of ChatGPT Images 2.0, the most circulated images on social media have been stunning demos: perfect menus, accurate multilingual posters, and coherent storyboards. These were mostly generated under optimal conditions by experienced users. Therefore, in actual use, the results may not be as consistent or polished.

The image below is a picture with Chinese characters generated by the author from a photo of a kitten. It even gave the kitten its own Chinese name: Xiao Jin. The text in the image is correct; there are no typos. However, the image's quality is clearly a level below that of an official photo.

OpenAI is solving a difficult but correct problem. Text rendering has gone from being "basically useless" to "readily usable," crossing a real-world usage threshold.

The "intention gap" hasn't completely disappeared. But it has narrowed, narrowed enough that some people can begin to rethink their workflows.

This article is from the WeChat public account "Emphasis Next" (ID: leo89203898) , author: Xinjian, published with authorization from 36Kr.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content