Microsoft Research announced Phi-2, a 2.7 billion-parameter Transformer-based language model. Phi-2 is trained on 1.4T tokens of synthetic data generated by GPT-3.5 and outperforms larger models on a variety of benchmarks.
Phi-2 is the latest iteration of Microsoft's Phi suite of models, which are trained on a mixture of web-crawled and synthetic "textbook-quality" datasets. The previous Phi models contain only 1.3B parameters, but showed excellent performance on coding and reasoning tasks. Phi-2 is twice as large as previous ones and was trained for two weeks on a cluster of 96 A100 GPUs. It has performance comparable to models which are up to 25x larger, outperforming the 70B parameter Llama-2 model on reasoning, language understanding, and coding benchmarks. According to Microsoft:
With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made Phi-2 available in the Azure AI Studio model catalog to foster research and development on language models.
InfoQ recently covered several efforts to replicate the abilities of large language models (LLMs) in smaller models. Many of these use LLMs such as ChatGPT to generate synthetic training datasets for the smaller model. Google's Distilling Step-by-Step method prompts a teacher LLM to automatically generate a small fine-tuning dataset that contains both an input with an output label, as well as a "rationale" for why the output label was chosen. Microsoft Research's Orca 2 uses a synthetic training dataset and a new technique called Prompt Erasure to achieve performance equal to or better than models that contain 10x the number of parameters.
The key innovation with the Phi series of models is a synthetic dataset of "textbook-like" data. Although the researchers have not released the dataset or even very many details of its generation, previous tech reports on the Phi models include high-level descriptions. One goal for the datasets was to generate "diverse and non-repetitive" examples that cover a range of "concepts, skills, and scenarios" that vary in "level of difficulty, complexity, and style." For Phi-1.5, the team selected 20k different topics for generated examples of language understanding problems.
Sebastien Bubeck, lead ML foundations team at Microsoft Research, posted on X about some additional work fine-tuning Phi-2:
phi-2 is really a good base for further fine-tuning: we [fine-tune] on 1M math exercises (similar to phi-1 w. CodeExercises) & test on recent French nation-wide math exam (published after phi-2 finished training). The results are encouraging! Go try your own data...
Mark Tenenholtz, the head of AI at Predelo, also posted about Phi-2, that "knowledge distillation really does work." In a Hacker News discussion about Phi-2, one user noted that the compute cost of training the model was probably around 30k USD, or "cheaper than a car." Another pointed out:
Note the model is trained on data generated by GPT-4. It's probably orders of magnitude more expensive to generate the data at current API prices. The whole point of these papers is that training data quality is key. I would much prefer for these companies to release the training data than the weights.
The Phi-2 model weights are available on HuggingFace.