Researchers at Google have open-sourced EvoLved sIgn mOmeNtum (Lion), an optimization algorithm for training neural networks, which was discovered using an automated machine learning (AutoML) evolutionary algorithm. Models trained with Lion can achieve better accuracy on several benchmarks than models trained with other optimizers, while requiring fewer compute cycles to converge.
Google used evolutionary search on symbolic programs to discover the algorithm. Lion uses less memory and fewer instructions than Adam, the de-facto standard optimization algorithm. One key difference between Lion and other optimizers is that Lion cares only about the sign of the gradient and applies the same magnitude update to each weight. Using Lion, the team trained several models for various benchmark tasks. The one trained for classification on ImageNet achieved new state-of-the-art accuracy. A ViT trained by Lion increased its ImageNet accuracy by 2%, while saving 5x training the compute cycles.
Deep learning models are trained by using a gradient descent algorithm to iteratively update the network's weights to produce a minimum value of the network's loss function. Two commonly used algorithms are Adam and Adafactor, which have been used to train most state-of-the-art models. While these algorithms are much improved over earlier ones, the Google team wondered whether it would be possible to use AutoML to improve on them.
To develop Lion, the team defined their search space as a set of functions written in an imperative language similar to Python; the inputs to them are model weights, gradient, and learning rate, and the outputs are weight updates. Search is conducted by mutating function code---adding, deleting, or modifying statements. Candidate functions are chosen by using them to train small ML models; the trained models' accuracy reflects the candidate training function's fitness.
After running several experiments, the team settled on a final function "due to its simplicity, memory efficiency, and strong performance..." Compared to Adam and Adafactor, Lion is a simpler algorithm and has fewer hyperparameters. Because it updates weights with a fixed value scaled only by learning rate, it requires learning rates 3x to 10x smaller than Adam, and a learning rate decay of 3x to 10x larger.
Lion pseudocode compared to Adam pseudocode. Image Source: https://arxiv.org/abs/2302.06675
The team used Lion to train many previously published models for a range of task domains, including vision, language, and multimodal; they compared the accuracy and training computation requirements of the models trained with Lion against those trained with Adam or Adafactor. Although Lion models did outperform Adam models in many instances, there were cases where the performance was similar, especially on tasks where "the datasets are massive and of high quality," such as vision tasks and masked language modeling. The authors also note that Lion probably performs the same as Adam on small batch sizes.
Several members of the research team responded to user questions about the work on Twitter. One commenter asked why the team compared Lion to Adafactor instead of to Adam for some models. Co-lead author Chen Liang replied:
[W]e compared to AdamW for most of the language benchmarks (smaller scale LM, masked LM, finetuning, etc). For [language models with more than 1B parameters], we compared to Adafactor because it is the default in the baseline and performs similar to AdamW with less cost.
Python implementations of Lion for several popular deep learning frameworks, including Pytorch and Tensorflow, are available on GitHub.