InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News Meta AI Introduces Thought Preference Optimization Enabling AI Models to Think before Responding

AI, ML & Data Engineering

Meta AI Introduces Thought Preference Optimization Enabling AI Models to Think before Responding

This item in japanese

Nov 04, 2024 2 min read

Write & Win: InfoQ Contest

Join the contest to:

Win a conference ticket
Boost your profile
Help the community

Send your article proposal

Researchers from Meta FAIR, the University of California, Berkeley, and New York University have introduced Thought Preference Optimization (TPO), a new method aimed at improving the response quality of instruction-fine-tuned LLMs. Unlike traditional models, which focus solely on final answers, this approach enables LLMs to generate and refine an internal thought process to produce more accurate and coherent responses.

The new technique incorporates a modified Chain-of-Thought (CoT) reasoning method. This approach encourages models to "think before responding" during training, helping them prepare structured internal thoughts before delivering a final answer. While direct CoT prompting can sometimes lower accuracy and is challenging to train due to the lack of explicit thought steps in instruction datasets, TPO addresses these limitations by allowing models to optimize and streamline their thought processes without exposing intermediate steps to users.

The diagram shows the Thought Preference Optimization (TPO) process, which starts by prompting a large language model (LLM) to generate various thoughts before crafting a response. Outputs are sampled and evaluated by a judge model to identify the best and worst responses. These outputs are then used as chosen and rejected pairs for Direct Preference Optimization (DPO). This iterative training method enhances the model's capacity to produce more relevant and high-quality responses, thereby improving its overall effectiveness.

In this method, training prompts are adjusted to encourage the model to think internally before responding. This sequence guides the LLM to refine its responses for greater clarity and relevance. Responses are then evaluated by an LLM-based judge model that scores only the final answers, allowing the model to enhance response quality based on effectiveness alone, independent of the hidden thought steps. TPO also uses Direct Preference Optimization (DPO) by creating preferred and rejected response pairs, including the hidden thoughts, to refine the model’s internal processes over multiple training cycles.

Benchmark win rates (%) for AlpacaEval (length-controlled (LC)) and Arena-Hard are presented. The method, Thought Preference Optimization (TPO), is compared to the direct response baseline, Llama-3-8B-Instruct, and Llama-3-8B-Instruct with Thought Prompting. The latter, which does not perform well, serves as initialization for the first iteration of TPO training. TPO optimizes thought generation through iterative training, ultimately outperforming the baselines. Several well-known LLMs are also included as references, and they are typically larger than the TPO model.

The TPO method goes beyond logic and math tasks, proving beneficial for diverse instruction-following tasks, including creative areas such as marketing and health.

AI & Robot, Doctor Karan Verma, shared the following post on X:

I'm intrigued by the concept of Thinking LLMs and its potential to revolutionize AI technology. As a digital health enthusiast, I'm curious to see how this innovation can be applied to healthcare applications and improve patient outcomes.

Structured internal thought processes allow the model to handle complex instructions more effectively, potentially extending its applications across fields requiring layered reasoning and nuanced understanding without requiring specific human-provided thought data. This research suggests that TPO could help make LLMs more adaptable and effective across varied contexts, with applications in fields that demand both flexibility and depth in response generation.

About the Author

Daniel Dominguez

Daniel is the Managing Partner at SamXLabs an AWS Partner Network company. He has over 13 years of experience in software product development for startups and Fortune 500 companies. Daniel holds a Machine Learning specialization from the University of Washington. He is passionate about leveraging AI and cloud computing to create innovative solutions. As an AWS Community Builder in the Machine Learning tier, Daniel is committed to sharing knowledge and driving innovation in software products.

Show moreShow less

Write Your Way to a QCon or InfoQ Dev Summit!

Join the InfoQ article competition to win a complimentary ticket to QCon or InfoQ Dev Summit! We're seeking in-depth technical articles written by software developers for software developers.

Send your proposal

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Meta AI Introduces Thought Preference Optimization Enabling AI Models to Think before Responding

Write & Win: InfoQ Contest

About the Author

Daniel Dominguez

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Write Your Way to a QCon or InfoQ Dev Summit!

The InfoQ Newsletter