BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News OpenAI Releases GPT-4o mini Model with Improved Jailbreak Resistance

OpenAI Releases GPT-4o mini Model with Improved Jailbreak Resistance

This item in japanese

OpenAI released GPT-4o mini, a smaller version of their flagship GPT-4o model. GPT-4o mini outperforms GPT-3.5 Turbo on several LLM benchmarks and is OpenAI's first model trained with an instruction hierarchy method that improves the model's resistance to jailbreaks and system prompt extraction.

GPT-4o mini supports the same languages and modalities as the full GPT-4o model, although currently the OpenAI API only allows text and vision, with audio and video input/output "coming in the future." The model also has the same context window, 128k tokens, and the same October 2023 training knowledge cutoff. It has the same built-in safety mitigations as GPT-4o, and in addition was trained using OpenAI's instruction hierarchy training method which gives models up to 30% better robustness against jailbreaks and 60% improved defense against system prompt extraction. On LLM benchmarks such as MMLU and HumanEval, GPT-4o mini outperforms comparable small LLMs such as Gemini Flash and Claude Haiku as well as GPT-3.5. According to OpenAI:

Over the past few years, we’ve witnessed remarkable advancements in AI intelligence paired with substantial reductions in cost...We’re committed to continuing this trajectory of driving down costs while enhancing model capabilities. We envision a future where models become seamlessly integrated in every app and on every website. GPT-4o mini is paving the way for developers to build and scale powerful AI applications more efficiently and affordably. The future of AI is becoming more accessible, reliable, and embedded in our daily digital experiences, and we’re excited to continue to lead the way.

While OpenAI has not published many technical details of the model, the company did recently publish a research paper on training models to follow an instruction hierarchy. The key idea is that many attack vectors against LLMs use the fact that "LLMs often consider system prompts to be the same priority as text from untrusted users and third parties." To address this, OpenAI developed a training dataset that teaches LLMs to ignore "lower-privileged" instructions when they conflict with higher ones.

To evaluate this method, the researchers first fine-tuned a model on the dataset then tested it on a set of both open-source attack benchmarks and proprietary ones. The fine-tuned model showed improved robustness on all benchmarks. The team did notice, however, that the model tended to "over-refuse" on some benchmarks, but they said they do not expect this "to cause noticeable degradations in model behavior" for real-world use cases.

OpenAI CEO Sam Altman posted on X that the company's best model in 2022, text-davinci-003, was "much, much worse" than GPT-4o mini. Also on X, the LMSYS team revealed that:

GPT-4o mini's early version "upcoming-gpt-mini" was tested in Arena in the past week. With over 6K user votes, we are excited to share its early score reaching GPT-4-Turbo performance, while offering significant cost reduction.

However, Wharton professor Ethan Mollick wrote:

First impressions with GPT-4o-mini (what a name) is that it is impressive for a small model but no replacement for a frontier model. When given  complex education prompts it can’t follow instructions as well & misses nuance GPT-4o nails.

GPT-4o mini is available via the OpenAI API as well as in ChatGPT.

About the Author

Rate this Article

Adoption
Style

BT