BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Google Publishes LLM Self-Correction Algorithm SCoRe

Google Publishes LLM Self-Correction Algorithm SCoRe

Researchers at Google DeepMind recently published a paper on Self-Correction via Reinforcement Learning (SCoRe), a technique for improving LLMs' ability to self-correct when solving math or coding problems. Models fine-tuned with SCoRe achieve improved performance on several benchmarks compared to baseline models. 

Unlike previous self-correction methods that rely on prompt engineering or separate "teacher" models, SCoRe uses data generated by the LLM itself to generate self-correction traces: synthetic dialogues where the LLM gives an incorrect response, followed by a correction prompt, followed by the LLM giving a correct response. This data is used in a two-stage RL process to fine-tune the LLM. When evaluated against baseline Gemini 1.0 models, the fine-tuned LLM improved by 15.6 percentage points on the MATH benchmark and 9.1 percentage points on HumanEval. According to Google, 

The importance of our two-stage recipe (based on careful initialization and reward shaping) in obtaining positive self-correction perhaps more generally hints that some kind of regularization is required to ensure that LLMs learn nuanced strategies that can generalize well to novel, unseen queries at test-time.

The DeepMind team developed SCoRe after studying the shortcomings of other methods. They said that "there is no major work" that shows that prompt engineering alone can result in successful self-correction in off-the-shelf models. Attempting to improve the model with supervised fine-tuning (SFT) typically requires a human or a stronger LLM to provide corrections. Methods which use SFT on self-generated corrections "often [amplify] the model’s bias" to not make corrections, or else "suffer from the curse of distributional shift."

SCoRe Training Stages

SCoRe Training Stages. Image source: Google DeepMind Research Paper

SCoRe improves on previous methods by using a two-stage RL process. In the first stage, the model is trained to keep its initial response the same but generate a correct response on the second attempt. In the second stage, the model is rewarded for correct answers in both responses, with a bonus reward for an improved second response. The goal is to prevent the model from learning to "produce the best first-attempt response and only minorly edit it."

In a discussion about SCoRe on Reddit, one user wrote:

Overall, it's interesting that it's taught how to make corrections. But I would have liked to see the 2rd, 4th, 5th turns of a few examples to see what improvements the test runs are producing. Informally, it reads like the 2nd turn can make a big difference, but the subsequent turns have diminishing returns. 

Users in a Hacker News discussion compared SCoRe to OpenAI's method of fine-tuning their Omni models:

OpenAI stated that one of the breakthroughs needed for o1's train of thought to work was reinforcement learning to teach it to recover from faulty reasoning. [It's] incredibly similar to this paper, which discusses the difficulty in finding a training method that guides the model to learn a self-correcting technique (in which subsequent attempts learn from and improve on previous attempts), instead of just "collapsing" into a mode of trying to get the answer right with the very first try.

InfoQ covered OpenAI's release of their Omni model earlier this year. InfoQ also covered OpenAI's use of an LLM to generate training data to improve code generated by ChatGPT.

About the Author

Rate this Article

Adoption
Style

BT