Apple released OpenELM, a Transformer-based language model. OpenELM uses a scaled-attention mechanism for more efficient parameter allocation and outperforms similarly-sized models while requiring fewer tokens for training.
Along with the model, Apple released their full framework, including data prep and training code. Since OpenELM was trained solely on publicly-available data, the model is fully reproducible by anyone. The researchers trained four sizes of the model: 270M, 450M, 1.1B, and 3B parameters; each is available in a base and instruction-tuned variant. The research team's experiments show that the instruction-tuned variants achieve 1 to 2 percentage points better performance on benchmarks. According to Apple,
Diverging from prior practices that only provide model weights and inference code, and pre-train on private datasets, our release includes the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations. We also release code to convert models to MLX library for inference and fine-tuning on Apple devices. This comprehensive release aims to empower and strengthen the open research community, paving the way for future open research endeavors.
One key feature of OpenELM is the layer-wise attention scaling. In contrast to most Transformer-based models, which use the same number of dimensions and parameters in each layer, OpenELM uses a smaller number in the "lower" layers (closer to the input) and more in the higher layers. This gives the model better accuracy for a given total number of parameters.
OpenELM is trained on a mix of publicly-available datasets, including The Pile and RedPajama. In total, the pre-training mix contains about 1.8T tokens. For instruction-tuning, the team used UltraFeedback, a publicly-available dataset of 60k prompts. The tuning algorithm used both rejection sampling and direct-preference optimization.
Apple's researchers evaluated OpenELM using LM Evaluation Harness, measuring its performance on a range of tasks, including common-sense reasoning and language understanding tasks from the OpenLLM leaderboard. The team compared their model to several models with similar parameter counts, including MobiLlama and OLMo. OpenELM outperformed these baseline models by up to 2.35 percentage points, even though Apple used "2× less pre-training data." OpenELM's results have not been reported to the OpenLLM leaderboard, but data from Apple's experiments show it would be near the current top 10 results.
Andrew Ng's AI newsletter, The Batch, highlighted OpenELM. Ng noted that the model "fell short on MMLU", with a score only slightly better than random chance:
To be fair, the other models chosen for comparison didn’t do much better. It’s possible that publicly available data isn’t sufficient for learning to solve MMLU. By comparison, Microsoft’s Phi-3-mini (3.8 billion parameters trained on web data filtered according to "educational level" plus generated data) achieved 68.8 percent accuracy.
In a discussion about OpenELm on Reddit, one user pointed out that:
The portability and isolation is where the value in this comes from. Now companies can train models without also essentially feeding data to 3rd parties.
The OpenELM code is available on GitHub, while the model weights are available on Huggingface.