BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Microsoft and Tsinghua University Present DIFF Transformer for LLMs

Microsoft and Tsinghua University Present DIFF Transformer for LLMs

Researchers from Microsoft AI and Tsinghua University have introduced a new architecture called the Differential Transformer (DIFF Transformer), aimed at improving the performance of large language models. This model enhances attention mechanisms by refining how models handle context and minimizing distractions from irrelevant information.

The key feature of the DIFF Transformer is its differential attention mechanism. It computes attention by comparing two separate attention maps, which helps the model focus more effectively on relevant parts of the input. This adjustment improves accuracy, particularly in tasks like question answering and text summarization.

The architecture also improves scalability, achieving performance similar to that of larger models with fewer training resources. This efficiency is beneficial for handling longer sequences of data, making it suitable for tasks that require processing large amounts of information at once.

Experiments show that the DIFF Transformer consistently surpasses traditional transformers in tasks like language modeling and information retrieval, offering improved performance and efficiency in large language models (LLMs). Its design enhances practical applications such as long-context modeling, key information retrieval, hallucination mitigation, and in-context learning while also reducing activation outliers. These improvements lead to better accuracy across diverse datasets and greater robustness to changes in input order, making the DIFF Transformer more suitable for low-resource environments.

The following table compares the zero-shot performance of the DIFF Transformer with several well-trained Transformer models, including OpenLLaMA-v2-3B, StableLM-base-alpha-3B-v2, and StableLM-3B-4E1T and the DIFF Transformer shows better or comparable results.

Enthusiasts and professionals have shown interest in its real-world application, particularly in scenarios where prediction accuracy might justify increased computational resources.

Data Science Kuldeep Singh shares on X:

While Google's Transformer might have introduced "Attention is all you need," Microsoft  and Tsinghua_Uni are here with the DIFF Transformer, stating, "Sparse-Attention is all you need."

AI Researcher Manu Otel wrote:

But, the diff transformer comes with a small tradeoff, it has double the key heads.

Discussions around the DIFF Transformer highlight a trade-off between computational cost and prediction accuracy. The model's need to perform attention operations twice could slow down both training and inference, but there's speculation on whether this could lead to better results with fewer training iterations or less data.

About the Author

Rate this Article

Adoption
Style

BT