DeepSeek open-sourced DeepSeek-V3, a Mixture-of-Experts (MoE) LLM containing 671B parameters. It was pre-trained on 14.8T tokens using 2.788M GPU hours and outperforms other open-source models on a range of LLM benchmarks, including MMLU, MMLU-Pro, and GPQA.
DeepSeek-V3 is based on the same MoE architecture as DeepSeek-V2 but features several improvements. V3 uses a new auxiliary-loss-free load balancing strategy and a Multi-Token Prediction (MTP) objective. The DeepSeek team also improved the efficiency of their training by switching to mixed-precision using the FP8 number format and by improving the parallelism and cross-node communication of the training framework code. The team evaluated the model on several benchmarks and compared it to baseline LLMs including Qwen2.5, Llama 3.1, Claude-Sonnet-3.5, and GPT-4o; DeepSeek-V3 outperformed the other models on a majority of tests, including five coding benchmarks and three mathematics benchmarks. According to DeepSeek:
While acknowledging its strong performance and cost-effectiveness, we also recognize that DeepSeek-V3 has some limitations, especially on the deployment...[A]lthough our deployment strategy for DeepSeek-V3 has achieved an end-to-end generation speed of more than two times that of DeepSeek-V2, there still remains potential for further enhancement. Fortunately, these limitations are expected to be naturally addressed with the development of more advanced hardware.
DeepSeek-V3 keeps the DeepSeekMoE architecture of V2, including the Multi-Head Latent Attention (MLA) scheme. Because it's a MoE model, only 37B of the total 671B parameters are activated for each token during inference. The new load balancer provides a "better trade-off between load balance and model performance" by introducing a bias term for each expert which is tuned during training.
The model was trained on a compute cluster of 2048 NVIDIA H800 GPUs; each node in the cluster contained 8 GPUs interconnected with NVLink and NVSwitch, and the nodes were connected by InfiniBand (IB). The team built a training framework, HAI-LLM, from the ground up. They developed a pipeline parallelism algorithm called DualPipe which "has fewer pipeline bubbles" and optimized memory usage to enable training "without using costly Tensor Parallelism."
After pre-training, DeepSeek-V3 was instruction-tuned on datasets of 1.5M examples from several domains, including mathematics and coding. This process included a mix of supervised fine-tuning and reinforcement learning; the latter included both rule-based and model-based rewards.
DeepSeek-V3 Benchmark Results. Image Source: DeepSeek-V3 Technical Report
Open source software developer Aldo Cortesi ran his own benchmark of DeepSeek-V3 and posted the results on X:
Incredible - [tied for first with Sonnet] for our practical coding examples, while being twice as fast as Sonnet. Also note that DeepSeek v3 made ZERO prompt adherence errors - the only model I've ever tested to do this.
Django framework co-creator Simon Willison also wrote about DeepSeek-V3 on his blog:
This is by far the highest ranking openly licensed model. The really impressive thing about DeepSeek v3 is the training cost. The model was trained on 2,788,000 H800 GPU hours at an estimated cost of $5,576,000. Llama 3.1 405B trained 30,840,000 GPU hours—11x that used by DeepSeek v3, for a model that benchmarks slightly worse.
The DeepSeek-V3 code is available on Github and the model files can be downloaded from Huggingface.