Meta recently open-sourced Code Llama, a code generation LLM which is based on the Llama 2 foundation model and carries the same community license. Code Llama was fine-tuned on 500B tokens of code and is available in three model sizes ranging up to 34B parameters. In evaluations on code-generation benchmarks, the model outperformed all other open-source models and is comparable to ChatGPT.
Meta used three sizes of the Llama 2 foundation model---7B, 13B, and 34B parameters---as starting points for Code Llama. These were fine-tuned on a "near-deduplicated" dataset of code as well as natural language related to code, such as questions and discussions. Meta also trained two variants of each model size, besides the base version: Code Llama - Python, which is further fine-tuned on Python code; and Code Llama - Instruct, which is fine-tuned on natural-language instructions. All nine model versions are licensed for commercial use. According to Meta,
Code Llama is designed to support software engineers in all sectors – including research, industry, open source projects, NGOs, and businesses. But there are still many more use cases to support than what our base and instruct models can serve...We hope that Code Llama will inspire others to leverage Llama 2 to create new innovative tools for research and commercial products.
InfoQ previously covered other code-generation AI models, including OpenAI's Codex, which is based on GPT-3 and powers Github's Copilot. Like the other models in the GPT series, Codex is only available via OpenAI's web service API. This has prompted the development of open models, such as BigCode's StarCoder. StarCoder also has the advantage of being trained on "permissively-licensed" code, so that the use of its output is unlikely to result in license violations. While Llama 2 and its derived models, including Code Llama, are licensed for commercial use, the Code Llama license notes that its output "may be subject to third party licenses."
In addition to fine-tuning the models on code, Meta also performed long context fine-tuning (LCFT), which increases the length of input the model can handle. While Llama 2 was trained on sequences up to 4k tokens, the LCFT for Code Llama includes sequences up to 16k. Meta's goal for this was "unlocking repository-level reasoning for completion or synthesis," giving the model access to an entire project's code instead of only a single function or source file. Meta's experiments show that the model exhibits "stable behavior" for sequences up to 100k tokens.
In a Twitter/X thread about the model, Furkan Gözükara, an assistant professor at Toros University, noted that GPT-4 still outperformed Code Llama on the HumanEval benchmark. Another user replied that GPT-4 was not "not 34B," meaning that GPT-4 was a far bigger model. The makers of phind, an AI assistant for programmers, released a fine-tuned version of the 34B parameter version of Code Llama - Python that they claim achieved 69.5% pass@1 score on HumanEval, which outperforms GPT-4's published score of 67%. One of the developers joined a Hacker News discussion about their release, and said:
This model is only the beginning -- it's an early experiment and we'll have improvements next week.
The Code Llama source code is available on GitHub. The model files can be downloaded after applying for approval from Meta.