InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News Microsoft and Tsinghua University Present DIFF Transformer for LLMs

AI, ML & Data Engineering

Microsoft and Tsinghua University Present DIFF Transformer for LLMs

This item in japanese

Oct 20, 2024 1 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Researchers from Microsoft AI and Tsinghua University have introduced a new architecture called the Differential Transformer (DIFF Transformer), aimed at improving the performance of large language models. This model enhances attention mechanisms by refining how models handle context and minimizing distractions from irrelevant information.

The key feature of the DIFF Transformer is its differential attention mechanism. It computes attention by comparing two separate attention maps, which helps the model focus more effectively on relevant parts of the input. This adjustment improves accuracy, particularly in tasks like question answering and text summarization.

The architecture also improves scalability, achieving performance similar to that of larger models with fewer training resources. This efficiency is beneficial for handling longer sequences of data, making it suitable for tasks that require processing large amounts of information at once.

Experiments show that the DIFF Transformer consistently surpasses traditional transformers in tasks like language modeling and information retrieval, offering improved performance and efficiency in large language models (LLMs). Its design enhances practical applications such as long-context modeling, key information retrieval, hallucination mitigation, and in-context learning while also reducing activation outliers. These improvements lead to better accuracy across diverse datasets and greater robustness to changes in input order, making the DIFF Transformer more suitable for low-resource environments.

The following table compares the zero-shot performance of the DIFF Transformer with several well-trained Transformer models, including OpenLLaMA-v2-3B, StableLM-base-alpha-3B-v2, and StableLM-3B-4E1T and the DIFF Transformer shows better or comparable results.

Enthusiasts and professionals have shown interest in its real-world application, particularly in scenarios where prediction accuracy might justify increased computational resources.

Data Science Kuldeep Singh shares on X:

While Google's Transformer might have introduced "Attention is all you need," Microsoft and Tsinghua_Uni are here with the DIFF Transformer, stating, "Sparse-Attention is all you need."

AI Researcher Manu Otel wrote:

But, the diff transformer comes with a small tradeoff, it has double the key heads.

Discussions around the DIFF Transformer highlight a trade-off between computational cost and prediction accuracy. The model's need to perform attention operations twice could slow down both training and inference, but there's speculation on whether this could lead to better results with fewer training iterations or less data.

About the Author

Daniel Dominguez

Daniel is the Managing Partner at SamXLabs an AWS Partner Network company. He has over 13 years of experience in software product development for startups and Fortune 500 companies. Daniel holds a Machine Learning specialization from the University of Washington. He is passionate about leveraging AI and cloud computing to create innovative solutions. As an AWS Community Builder in the Machine Learning tier, Daniel is committed to sharing knowledge and driving innovation in software products.

Show moreShow less

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Microsoft and Tsinghua University Present DIFF Transformer for LLMs

Write for InfoQ

About the Author

Daniel Dominguez

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter