BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News University Researchers Publish Analysis of Chain-of-Thought Reasoning in LLMs

University Researchers Publish Analysis of Chain-of-Thought Reasoning in LLMs

Researchers from Princeton University and Yale University published a case study of Chain-of-Thought (CoT) reasoning in LLMs which shows evidence of both memorization and true reasoning. They also found that CoT can work even when examples given in the prompt are incorrect.

The study was motivated by the persistent debate in the research community about whether LLMs can truly reason, or if their outputs are simply based on heuristics and memorization. The team used a simple task, decoding shift ciphers, as their case study. They found that an LLM's performance using CoT prompting depended on a mix of both memorization and what the team described as "noisy" reasoning, as well as the overall probability of the correct output. According to the researchers:

[W]e find evidence that the effect of CoT fundamentally depends on generating sequences of words that increase the probability of the correct answer when conditioned upon; as long as this is the case, CoT can thus succeed even when the demonstrations in the prompt are invalid. In the ongoing debate about whether LLMs reason or memorize, our results thus support a reasonable middle-ground: LLM behavior displays aspects of both memorization and reasoning, and also reflects the probabilistic origins of these models.

The team chose the task of decoding shift ciphers because it has a "sharp dissociation" between its complexity and its frequency of use in the internet sources used to train LLMs. The task becomes more difficult the larger the shift value; however, the most difficult case is also the most commonly used on the internet: rot-13. If LLMs are simply memorizing, then they would perform better on rot-13 than if they are really using reasoning. By contrast, if they are truly reasoning, they would perform best on rot-1 and rot-25, and worst on rot-13.

Expected and Actual Results

Expected and Actual Results. Image Source: Akshara Prabhakar

The team created a dataset of 7-letter words which were also tokenized to exactly 2 tokens by GPT-4. They also calculated each word's probability of GPT-2 using it to complete the sentence "The word is". This allowed the researchers to control for how likely an LLM would output it simply based on probabilities. They then produced shifted versions of these words and ran their experiment GPT-4, Claude 3, and Llama-3.1-405B-Instruct.

The team also performed an experiment where instead of words, the models were asked to decode sequences of numbers using arithmetic. This task is "isomorphic" to the shift-cipher task, but uses only numbers. The authors found that on this task GPT-4 performed "nearly perfectly," and concluded that it "has the core reasoning abilities" needed to perform the shift-cipher task accurately for all shift values. Since it did not, they conclude that CoT "is not pure symbolic reasoning." However, they do note that CoT improves performance vs. "standard" prompting, so CoT is not "simple memorization."

Research team member R. Thomas McCoy, a professor at Yale University, posted about the work on X. In response to a question from another user who wondered if different CoT prompts would give different results, he wrote:

Yes, I think there is lots to explore there! [co-author Akshara Prabhakar] did have some cool experiments that involved translating from letters to numbers within the CoT. That generally improved performance, but got a qualitatively similar chart. So that's one case that is similar. But there could well be others that give different trends!

The experimental code and data for the study are available on GitHub.

About the Author

Rate this Article

Adoption
Style

BT