InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News University Researchers Publish Analysis of Chain-of-Thought Reasoning in LLMs

AI, ML & Data Engineering

University Researchers Publish Analysis of Chain-of-Thought Reasoning in LLMs

This item in japanese

Oct 22, 2024 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Researchers from Princeton University and Yale University published a case study of Chain-of-Thought (CoT) reasoning in LLMs which shows evidence of both memorization and true reasoning. They also found that CoT can work even when examples given in the prompt are incorrect.

The study was motivated by the persistent debate in the research community about whether LLMs can truly reason, or if their outputs are simply based on heuristics and memorization. The team used a simple task, decoding shift ciphers, as their case study. They found that an LLM's performance using CoT prompting depended on a mix of both memorization and what the team described as "noisy" reasoning, as well as the overall probability of the correct output. According to the researchers:

[W]e find evidence that the effect of CoT fundamentally depends on generating sequences of words that increase the probability of the correct answer when conditioned upon; as long as this is the case, CoT can thus succeed even when the demonstrations in the prompt are invalid. In the ongoing debate about whether LLMs reason or memorize, our results thus support a reasonable middle-ground: LLM behavior displays aspects of both memorization and reasoning, and also reflects the probabilistic origins of these models.

The team chose the task of decoding shift ciphers because it has a "sharp dissociation" between its complexity and its frequency of use in the internet sources used to train LLMs. The task becomes more difficult the larger the shift value; however, the most difficult case is also the most commonly used on the internet: rot-13. If LLMs are simply memorizing, then they would perform better on rot-13 than if they are really using reasoning. By contrast, if they are truly reasoning, they would perform best on rot-1 and rot-25, and worst on rot-13.

Expected and Actual Results

Expected and Actual Results. Image Source: Akshara Prabhakar

The team created a dataset of 7-letter words which were also tokenized to exactly 2 tokens by GPT-4. They also calculated each word's probability of GPT-2 using it to complete the sentence "The word is". This allowed the researchers to control for how likely an LLM would output it simply based on probabilities. They then produced shifted versions of these words and ran their experiment GPT-4, Claude 3, and Llama-3.1-405B-Instruct.

The team also performed an experiment where instead of words, the models were asked to decode sequences of numbers using arithmetic. This task is "isomorphic" to the shift-cipher task, but uses only numbers. The authors found that on this task GPT-4 performed "nearly perfectly," and concluded that it "has the core reasoning abilities" needed to perform the shift-cipher task accurately for all shift values. Since it did not, they conclude that CoT "is not pure symbolic reasoning." However, they do note that CoT improves performance vs. "standard" prompting, so CoT is not "simple memorization."

Research team member R. Thomas McCoy, a professor at Yale University, posted about the work on X. In response to a question from another user who wondered if different CoT prompts would give different results, he wrote:

Yes, I think there is lots to explore there! [co-author Akshara Prabhakar] did have some cool experiments that involved translating from letters to numbers within the CoT. That generally improved performance, but got a qualitatively similar chart. So that's one case that is similar. But there could well be others that give different trends!

The experimental code and data for the study are available on GitHub.

About the Author

Anthony Alford

Anthony is a Senior Director, Development at Genesys where he is working on several AI and ML projects related to customer experience. He has over 20 years experience in designing and building scalable software. Anthony holds a Ph.D. degree in Electrical Engineering with specialization in Intelligent Robotics Software and has worked on various problems in the areas of human-AI interaction and predictive analytics for SaaS business optimization.

Show moreShow less

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

University Researchers Publish Analysis of Chain-of-Thought Reasoning in LLMs

Write for InfoQ

About the Author

Anthony Alford

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter