OpenAI recently announced the latest version of their GPT AI foundation model, GPT-4o. GPT-4o is faster than the previous version of GPT-4 and has improved capabilities in handling speech, vision, and multilingual tasks, outperforming all models except Google's Gemini on several benchmarks.
The "o" in GPT-4o stands for "omni," reflecting the model's multi-modal capabilities. While previous versions of ChatGPT supported voice input and output, this used a pipeline of models: a distinct speech-to-text to provide input to GPT-4, followed by a text-to-speech model to convert GPT-4's text output to voice. The new model was trained end-to-end to handle audio, vision, and text, which reduces latency and gives GPT-4o access to more information from the input as well as control over the output. OpenAI evaluated the model on a range of benchmarks, including common LLM benchmarks as well as their own AI safety standards. The company has also performed "extensive external red teaming" on the model to discover potential risks in its new modalities. According to OpenAI:
We recognize that GPT-4o’s audio modalities present a variety of novel risks. Today we are publicly releasing text and image inputs and text outputs. Over the upcoming weeks and months, we’ll be working on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities. For example, at launch, audio outputs will be limited to a selection of preset voices and will abide by our existing safety policies. We will share further details addressing the full range of GPT-4o’s modalities in the forthcoming system card.
OpenAI gave a demo of the GPT-4o and its capabilites in their recent Spring Update livestream hosted by CTO Mira Murati. Murati announced that the new model will be rolled out to free users, along with access to features such as custom GPTs and the GPT Store which were formerly available only to paid users. She also announced that GPT-4o would be available via the OpenAI API and claimed the model was 2x faster than GPT-4 Turbo, with 5x higher rate limits.
In a Hacker News discussion about the release, one user noted:
The most impressive part is that the voice uses the right feelings and tonal language during the presentation. I'm not sure how much of that was that they had tested this over and over, but it is really hard to get that right so if they didn't fake it in some way I'd say that is revolutionary.
OpenAI CEO Sam Altman echoed this sentiment in a blog post:
The new voice (and video) mode is the best computer interface I’ve ever used. It feels like AI from the movies; and it’s still a bit surprising to me that it’s real. Getting to human-level response times and expressiveness turns out to be a big change. The original ChatGPT showed a hint of what was possible with language interfaces; this new thing feels viscerally different. It is fast, smart, fun, natural, and helpful.
Along with GPT-4o, OpenAI released a new MacOs desktop app for ChatGPT. This app supports voice mode for conversing with the model, with the ability to screenshots to the discussion. OpenAI has also launched a simplified "look and feel" for the ChatGPT web interface.