How GPT-4o took away Midjourney's job

avatar
36kr
04-11
This article is machine translated
Show original
On March 26, 2025, OpenAI officially announced that the native multi-modal image generation function of GPT-4o is now online. Users no longer need to log in to OpenAI's text-to-image model DALL-E and can directly call 4o to generate and modify images in the ChatGPT app. Overnight, AI-generated Ghibli-style images spread like wildfire on social media platforms like X, flooding the internet with a gentle anime-style wave. People uploaded selfies and entered "Ghibli style", and within seconds, the beauty and fantasy of Miyazaki's animations flowed through hair and clothing edges. Even OpenAI founder Altman shared his Ghibli-style avatar, adding fuel to the trend. However, the Ghibli boom is just one aspect. More critically, **GPT-4o's drawing capabilities have broken the existing landscape of text-to-image generation, challenging the leading applications like Midjourney for the first time.** Originally, when people used Midjourney to generate images, they still faced a fatal problem: high randomness, with significant detail reduction when prompt words became too complex. GPT-4o's leap in image control allows humans to experience the charm of precise image editing through multi-round dialogues with AI artists for the first time. [The rest of the translation follows the same professional and accurate approach, maintaining the original text's structure and meaning while translating into fluent English.]

Second, the control of generating and editing images has improved, with GPT-4o being able to fully restore the instructions you give it. For example, with the same instruction "generate a scene of a cat and a dog playing on the grass," GPT-4o generates exactly one cat and one dog playing on the grass, without any unexpected elements that might suddenly appear, unlike Midjourney which might add an extra park or building to the lawn. In simpler terms, GPT-4o is better at understanding human language, acting like an electronic servant that does exactly what you ask, neither doing anything extra nor missing anything essential, with greater precision.

As a result, GPT-4o opens up a universal track and enters our work scenarios. Previously, ordinary users used Midjourney more out of interest, with strong entertainment attributes and weak tool attributes. Although the generated oil paintings, anime, and other style images looked good, they neither improved work efficiency nor could generate income, mainly serving a stylistic purpose.

GPT-4o's ability to edit images verbally has increased the number of industries where AI drawing can be applied, moving from entertainment and artistry towards professionalism and productivity, applicable to e-commerce, education, architecture, design, and other fields. For instance, if your child struggles with homework, previously you would need to hire a tutor or download homework help apps, with tutoring being expensive and homework help providing only dry, confusing text explanations. However, GPT-4o can completely generate an explanatory draft image, showing how functions are generated, how answers are derived, with a smooth and natural derivation process.

Take another example from the e-commerce industry's promotional posters. If a client needs an English poster targeting the European and American markets, with design elements and language requiring localization. Previously, the process involved coordinating with designers to modify elements, using translation software for refinement, and then importing into Photoshop for modifications - time-consuming and labor-intensive. Now, with GPT-4o, a single sentence like "transform this poster to European-American style with English language" can quickly design a poster meeting the requirements, showcasing strong cross-domain and cross-disciplinary integration capabilities.

03 Beyond Drawing, the Next Stop for Large Models is an Integrated Platform

After discussing GPT-4o's image generation breakthrough, let's explore what other potential this underlying model might have.

We know that Midjourney is an application built on a model, but GPT-4o itself is a model, with image generation being just one of its capabilities. When ChatGPT first emerged in 2022, it was merely a text conversation assistant, then gained voice call abilities, and now can draw images, continuously iterating and upgrading across different dimensions.

GPT-4o's breakthrough in the image generation track can truly be attributed to its significant emergence of native multimodal model capabilities. Unlike Midjourney, GPT-4o has more technological paths to explore. Currently, text-to-image applications commonly use diffusion models, which generate a rough image and then remove noise, like painting in snow or viewing through fog, with limited restoration capabilities. GPT-4o uses a text-to-image self-regression model, essentially extending the previous token prediction and reasoning logic to the text-to-image domain, drawing frame by frame and predicting the next pixel from already generated pixels, fundamentally mimicking human painting. This means that, unlike vertical applications, large models can choose different technological paths from the underlying architecture, and architectural upgrades typically bring performance leaps, with native models like GPT-4o having more room for functional growth.

Second, multimodal integration will bring cross-domain comprehension. As a general large model, GPT-4o possesses the ability to integrate different format information like text, audio, and images, and can currently make calls, generate and edit images. In the future, the possibility of directly generating music and videos is worth anticipating. In fact, GPT-4o's current image generation functionality originated from OpenAI's text-to-image model DALL-E. Perhaps OpenAI's text-to-video model Sora might also be integrated into the GPT model through some technique. By then, cross-processing multiple modal information within one model will no longer be distant.

This multimodal innovation further illustrates that as models become multi-functional, their ability to handle various tasks grows stronger, reducing overall AI usage costs. A foreseeable trend is that large models are attempting to become a one-stop packaging site, integrating multiple tasks like coding, design, music, and data processing. It's possible that in the future, a model like ChatGPT might become so powerful that it ranks in the top three in any domain, eliminating the need to download specialized applications like Midjourney for drawing, Coze for coding, or Suno for music. Instead, we could simply download a model similar to ChatGPT to solve everything. This would increase phone memory, improve running efficiency, and potentially save around a hundred dollars monthly in specialized application membership fees, offering better cost-effectiveness.

In short, GPT-4o's image generation breakthrough reveals the underlying large model's capability to integrate multiple applications. The derived vision is that in the future, we might simultaneously use drawing, music, coding, and other multi-dimensional capabilities in a one-stop model. Moreover, its usage threshold would be extremely low, accessible to anyone without technical background or even prior AI knowledge.

And perhaps this is the ultimate goal of humans inventing AI - to make technology universally accessible.

This article is from the WeChat public account "Brain Extreme" (ID: unity007), authored by Coral, published by 36Kr with authorization.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments