OpenAI Releases Improved Image Generation in GPT-4o

OpenAI released a new version of GPT-4o with native image generation capability. The model can modify uploaded images or create new ones from prompts and exhibits multi-turn consistency when refining images and improved generation of text in images.

OpenAI's CEO Sam Altman announced the release in a recent livestream. Unlike the previous iteration of the chat model, which invoked an external model like DALL-E to generate images, the new model is trained to handle image output as a native modality. It uses an autoregressive generation method, while models like DALL-E and Stable Diffusion use a diffusion method. According to OpenAI:

GPT‑4o image generation excels at accurately rendering text, precisely following prompts, and leveraging 4o’s inherent knowledge base and chat context—including transforming uploaded images or using them as visual inspiration. These capabilities make it easier to create exactly the image you envision, helping you communicate more effectively through visuals and advancing image generation into a practical tool with precision and power.

OpenAI trained the new model on a combination of image and text data, including "aggressive post-training." While OpenAI did not release technical details about the model or its performance on benchmarks, they released several sample images and the prompts used to generate them. OpenAI claims that the model can generate images with "up to 10-20 different objects," although it may "struggle to accurately render more."

As a safety feature, the images generated by GPT-4o include C2PA tags showing that they were generated by AI. OpenAI also built an internal tool to help determine if an image was generated by their models. OpenAI will block generation of images that violate their content policies, but Kevin Weil, CPO of OpenAI, wrote on X that:

If you explicitly ask for something edgy (within reason), the model should respect your intent. As we said in our model spec, giving users creative control matters, and we'll continue listening and adapting based on feedback.

OpenAI updated the 4o model's system card to describe its potential risks and the mitigations taken, including extensive red-teaming exercises. The system card also lists cases where the model will refuse to generate images: for example, it will refuse prompts that ask for images in the style of a living artist. However, in a change to previous policy, the model will generate images of a public figure, as long as the images do not otherwise violate OpenAI policy.

Hacker News users commented on the quality of the generated images, particularly mentioning its ability to correctly render text in images. One user wrote:

It very much looks like a side effect of this new architecture. In my experience, text looks much better in recent DALL-E images (so what ChatGPT was using before), but it [was] still noticeably mangled when printing more than a few letters. This model update seems to improve text rendering by a lot, at least as long as the content is clearly specified.

OpenAI noted that the model "struggles" with rendering languages that use non-Latin characters and might produce text that is "inaccurate or hallucinated."

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter