The BigCode Project recently released The Stack, a 6.4TB dataset containing de-duplicated source code from permissively licensed GitHub repositories which can be used to train code generation AI models. BigCode also released SantaCoder, a 1.1B parameter code generation model trained on The Stack. SantaCoder outperforms similar open-source code generation models.
BigCode is a collaborative organization sponsored by HuggingFace and ServiceNow Research, with the mission of developing responsible and open-source language models. In response to recent criticism of some code generation AI models for using copyrighted code in their training data, BigCode began investigating the performance of models trained only on source code with permissive licenses, such as Apache or MIT. BigCode also created web-based tools for developers to determine if their code is contained in The Stack and to request it be excluded. To test the performance of models trained on The Stack, BigCode trained SantaCoder, which outperforms previous open-source code generation models on the MultiPL-E benchmark. According to BigCode:
We release all permissively licensed files for 30 common programming languages, along with a near-deduplicated version. In future work, we would like to further improve the released dataset. We are open to releasing data of other programming languages, plan to work on methods for removing PII and malicious code, and start experimenting with giving developers the possibility to have their data removed from the dataset. We hope The Stack will be a useful resource for open and responsible research on Code LLMs.
AI models for generating code are currently an active research area. In 2021, InfoQ covered OpenAI's Codex and GitHub's CoPilot, which are based on GPT-3 language models that are fine-tuned on code stored in public GitHub repositories. Although these models perform quite well at generating code, they have been criticized for copyright violations. In late 2022, InfoQ covered a lawsuit against Microsoft and OpenAI that alleges copyright violations, including lack of attribution required by the licenses of the included source code.
One goal of The Stack is to avoid these violations by only including source code with permissive licenses; that is "those with minimal restrictions on how the software can be copied, modified, and redistributed." This includes MIT and Apache 2.0, but excludes "copyleft" licenses such as GPL, in part because copyleft advocates point out that models trained on GPL code could be considered "derivative works" which must themselves adopt the copyleft license.
Because excluding these repositories reduces the amount of training data, the BigCode team investigated whether this would reduce the performance of models trained on the dataset. They found that by near-deduplicating the dataset---that is, by removing from the dataset both exact duplicates as well as files that are very similar---that model quality was competitive with Codex. When training the 1.1B parameter SantaCoder model, the team discovered that filtering the dataset to only include 5-star repositories, however, reduces model quality "significantly."
Thomas Wolf, co-founder of HuggingFace, joined a Twitter discussion about SantaCoder. In response to a user's complaint about the quality of code generated by the model, Wolf replied:
It’s a completion model, not (yet) an instruct fine tuned model so you should formulate your task as a completion task. For instance by writing your prompt as a code comment or docstring.
Both The Stack dataset and the SantaCoder model are available on HuggingFace.