In the early morning, Google unveiled a new model, and OpenAI urgently released GPT-4o, which can Photoshop pictures just by moving your mouth. Netizens: Thank DeepSeek again

avatar
36kr
03-26
This article is machine translated
Show original

In the early morning of March 26th, Beijing time, Google released Gemini Pro 2.5, which is known as the most powerful inference model. Before Google, OpenAI took the lead in launching a live broadcast and released GPT-4o image generation, an image generation technology model. Interestingly, in the past six months, almost every release of Google has collided with OpenAI's live broadcast.

OpenAI releases GPT-4o, native multimodal image generation capabilities

“Starting today, OpenAI is integrating new image generation capabilities directly into ChatGPT — the feature is called ‘Images in ChatGPT’. Users can now generate images inside ChatGPT using GPT-4o,” OpenAI said.

This initial release is focused solely on image creation and will be available in the ChatGPT Plus, Pro, Team, and Free subscription tiers.

Notably, the GPT-4o image generation tagger vocabulary (effectively the number of unique integers used to represent text) has increased from ~100k in GPT-4 and GPT-3.5 to ~200k. Gujarati input uses 4.4x fewer tokens, Japanese 1.4x fewer, and Spanish 1.1x fewer. Previously, languages other than English paid a substantial price in terms of how much text could fit in a prompt.

Also worth noting is the price. OpenAI claims a 50% price reduction compared to GPT-4 Turbo. For a more intuitive comparison, GPT-4o costs exactly 10 times as much as GPT-3.5; 4o is $5/million input tokens and $15/million output tokens. 3.5 is $0.50/million input tokens and $1.50/million output tokens.

The price drop is particularly notable because OpenAI has promised to make the model available to free ChatGPT users as well — the first time they’ve made their “best” model available directly to non-paying customers.

“This model is a big improvement over previous models,” OpenAI research head Gabriel Goh told CNBC, adding that the team used GPT-4o “omnimodal” — a model that can generate any type of data, such as text, images, audio, and video — as the basis for the feature.

OpenAI stated in the announcement that GPT-4o’s image generation function has the following features:

  • Accurately render text within images, making it possible to create logos, menus, invitations, infographics, etc.
  • Precisely execute complex instructions, even in highly detailed compositions;
  • Expand on previous images and text to ensure visual consistency across multiple interactions;
  • Supports a variety of art styles, from photo-realism to illustrations and more.

Let's first feel the effect of generating pictures.

OpenAI released a picture of a woman writing on a whiteboard with her back to the camera during its official demo presentation.

The picture looks like a very daily life photo, but in fact, it is an AI picture generated by GPT-4o. The prompt words given by OpenAI are as follows:

“A wide image captured with a cell phone of a glass whiteboard in a room overlooking the Bay Bridge. A woman is writing in the picture, wearing a T-shirt with a prominent OpenAI logo. The handwriting is natural and slightly messy, and the photographer’s figure is projected on the whiteboard.”

In the second picture, the direction of the characters was changed. From the photographer’s selfie angle, the woman in the picture turned to high-five him, but the generated image still did not look like it was produced by AI.

It can also generate four-frame comic strips, and pay attention to leaving blank space between the border and the edge of the picture. The prompt words are as follows:

"A small snail is sitting on a fancy car showroom counter, and the salesman has to lean over to see him. In one particular shot, the snail has a serious expression and says, 'I want your fastest sports car... with a capital 'S' painted on the doors, hood, and roof.'

The salesman scratched his head, "Uh...of course it's okay. But why is it "S"? "

The scene cuts to a red car whizzing down the highway with a giant “S” written all over it. People on the side of the road are pointing and laughing and saying, ‘WOW! LOOK AT THAT S‑CAR GO!’”

Generate an infographic that explains the Newton's prism experiment in detail.

Then, now generate a first-person view of a person at a graphic coffee table in Washington Square Park, drawing this in a notebook.

Then, now in the same scene, show an excited young Newton sitting at a table, holding a prism to demonstrate the results of an experiment, and be careful not to show the notebook in the picture.

Multiple functions are iterated to generate better images

According to OpenAI's official statement, GPT-4o has been improved in many aspects compared to previous models:

  • Better text integration: Unlike past AI models that struggled to generate clear, well-placed text, GPT-4o can now accurately embed text into images;
  • Enhanced contextual understanding: GPT-4o allows users to continuously refine images during interactions by leveraging chat history, and maintains
  • Improved multi-object binding: While past models had difficulty correctly localizing multiple different objects in a scene, GPT-4o can now handle up to 10-20 objects at a time;
  • Diverse style adaptation: The model can generate or convert images into a variety of styles, supporting the conversion from hand-drawn sketches to high-definition realistic styles.

OpenAI said that from the first cave paintings to modern infographics, humans have been using visual images to communicate, convey and analyze. Today's generative models can present surreal and amazing scenes, but it is difficult to handle practical images that people use to share and create information. In fact, from logos to charts, images based on common language and experience-related symbols can often convey precise expressions.

GPT-4o image generation is good at accurately presenting text, accurately following prompts, and using 4o's inherent knowledge base and chat context - including directly converting uploaded images or using them as visual inspiration. These features can easily create images that everyone imagines, help users communicate smoothly through visual effects, and truly transform image generation into a practical tool with accuracy and powerful real-world significance.

By training the model with online images and text content, GPT-4o image generation not only learns the internal association between images and language, but also grasps the correspondence between the two. Combined with active post-training design, the generative model achieves surprising visual fluency and can generate highly practical, consistent and context-aware images.

A picture is worth a thousand words, but sometimes just a few words in the right place can significantly enhance the expression of an image. 4o combines precise symbols with images to make image generation truly visually communicative.

OpenAI released some official examples.

Create a realistic image of two witches in their 20s (one with gray highlights and the other with long wavy auburn hair) reading a street sign.

Tips:

On a street in Williamsburg, New York, the road signs display a large number of detailed street symbols (such as street sweeping times, parking permit requirements, vehicle classifications, and towing regulations), as well as some overhead information (presented in the form of legal street markings), such as "No parking of witches' brooms in Zone C", "Magic carpet unloading only (no more than 15 minutes)", and "Reindeer parking only by permit (December 24-25), violators will be put on the naughty list." The road signs are located on the right side of the street, the content cannot be repeated, and the signs must be reproduced authentically.

Characters: One witch is holding a broom, the other is holding a rolled-up magic carpet. They are in the foreground, facing away from the screen, their heads slightly tilted and looking at the street sign carefully. Composition from background to foreground: Street + parked cars + buildings -> street sign -> witch. The characters must be in the position closest to the camera.

Multiple rounds of generation

Image generation is now a native feature in GPT-4o, so users can optimize image content through natural conversation. GPT-4o can be built from images and text in a chat environment to ensure that the content is always consistent. For example, if a user is designing a video game character, the character's appearance will remain consistent across multiple iterations as they continue to improve and experiment.

In the video game scene, refer to the input cat image and add a detective hat and a monocle to the cat.

The graphics were converted to AAA video game style graphics made with a 4k game engine, and UI elements were added to present an RPG-like overlay. There is a health bar and a mini-map at the top, and spell icons in the same style below.

Updated the graphics to 16:9 landscape, added more spell elements to the UI, and scaled down the spawned cat to be viewed from a third-person perspective as it walks through the streets of steampunk Manhattan. Note the use of cool colors and beautiful contrast and lighting effects common in AAA games.

Create an interface that shows the kitten's character profile and equipment when the player opens the menu, and another page showing the current quest (the quest content should be relevant to the world view presented in the image).

Follow instructions

GPT-4o’s image generation capabilities follow detailed cues and maintain attention to detail. While other systems often struggle with images containing 5 to 8 objects, GPT-4o is able to handle up to 10 to 20 different objects, with better control over the tight binding of each object, its features, and its relationships to one another.

Generates a square image consisting of a 4-row, 4-column grid containing 16 objects on a white background. From left to right, top to bottom, the objects are:

  1. A blue star
  2. Red triangle
  3. Green Square
  4. Pink round
  5. Orange hourglass
  6. Purple infinity symbol
  7. Black and white polka dot bow tie
  8. Tie-dye textured number 42
  9. An orange cat wearing a black baseball cap
  10. A map with a treasure chest
  11. A pair of big eyes
  12. Thumbs up emoticon
  13. A pair of scissors
  14. A blue and white giraffe
  15. The word "OpenAI" written in cursive
  16. A rainbow lightning bolt

Realism and Graphic Style

By incorporating footage reflecting a variety of image styles into its training, the 4o model is able to generate or transform images realistically.

A paparazzi-style photo of Karl Marx hurrying through the Mall of America parking lot, looking back with a horrified expression on his face, not wanting to be harassed by the camera. He clutches several shiny shopping bags filled with luxury goods. His coat flaps in the wind, and one of the bags swings as if he is striding. The background, cars and the glowing mall entrance are blurred to emphasize the movement. The camera's flash is partially overexposed, giving it a tabloid feel.

Although the generated images are vivid and realistic, OpenAI also admits that these models are not perfect and have found many limitations. OpenAI will continue to improve them after the initial release.

In an interview with the media, Goh also mentioned, “Ultimately, no system is perfect, but we are constantly improving our safeguards, and we think this is a starting point. All images generated by ChatGPT have one thing in common, that is, users own them and can use them at will within the scope of our usage policy.”

Additionally, OpenAI supports the generation of images of public figures and historically inaccurate but user-specified images.

With this update, OpenAI is paying more attention to safety than ever before.

OpenAI said, "According to the model specification, we hope to maximize creative freedom by supporting use cases with real value such as game development, historical exploration, and education, while maintaining strict safety standards. In other words, blocking illegal requests is a necessary prerequisite for ensuring the implementation of the system. We are working hard to ensure safe and highly useful content through the following means, while supporting users to express their inspiration and ideas widely through creativity."

First, traceability is achieved through C2PA and internal reversible search. Currently, all generated images come with C2PA metadata, which is used to indicate that the image comes from GPT-4o to ensure openness and transparency. In addition, OpenAI has built an internal search tool that uses the properties of the generated technology to help verify whether the content comes from our model.

Secondly, OpenAI said it will resolutely block bad content. It will continue to block requests for generating images that may violate content policies, such as child sexual abuse materials and deep fake pornographic images. For real-person images in context, OpenAI will strengthen the restrictions on the images that can be created, and take extremely strict measures for nudity and violence. Of course, security upgrades will never end and will become an important area of continued investment.

Third, use reasoning to enhance security. OpenAI has trained a large reasoning model that is responsible for identifying and resolving ambiguities in policies based on explainable safety specifications written by humans. Combining ChatGPT with the multimodal security technology used by Sora, it is possible to flexibly adjust input text and output images according to existing policies.

However, although the 4o image generation technology has surpassed DALL·E 3 in terms of gender diversity, the output results are still mainly biased towards male subjects. Therefore, OpenAI said that its future work will focus on improving data balance and making the model more fair.

Access method and online time

As the default image generation tool in ChatGPT, 4o image generation is now available to Plus, Pro, Team, and Free users. Enterprise and Edu access will be available later. Sora can also benefit from this feature upgrade. For users who wish to continue using DALL-E, they can access this new feature through the dedicated DALL-E GPT.

Developers will soon be able to use GPT-4o’s image generation capabilities through an API, with access set to open in the coming weeks.

OpenAI says the entire image creation and customization process is as easy as chatting with GPT-4o - just describe your needs, including details such as the aspect ratio, precise color using hexadecimal codes, or transparent background. Because this model is able to generate images with more details, rendering time may be longer, up to 1 minute.

Reference Links:

https://openai.com/index/introducing-4o-image-generation/

This article comes from the WeChat public account "AI Frontline" , compiled by Dongmei, and published by 36Kr with authorization.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments