Hugging Face and ServiceNow have partnered to develop StarCoder, a new open-source language model for code. The model created as a part of the BigCode Initiative is an improved version of the StarCoderBase model trained on 35 billion Python tokens. StarCoder, is a free AI code-generating system alternative to GitHub’s Copilot, DeepMind’s AlphaCode, and Amazon’s CodeWhisperer.
StarCoder was trained in over 80 programming languages as well as text from GitHub repositories, including documentation and Jupyter programming notebooks, plus it was trained on over 1 trillion tokens and with a context window of 8192 tokens, giving the model an impressive 15.5 billion parameters. This has outperformed larger models like PaLM, LaMDA, and LLaMA, and has proven to be on par with or even better than closed models like OpenAI’s code-Cushman-001.
StarCoder will not ship as many features as GitHub Copilot, but with its open-source nature, the community can help improve it along the way as well as integrate custom models, said Leandro von Werra, one of the co-leaders on StarCoder to TechCrunch.
The StarCoder LLM was trained using code from GitHub, therefore it might not be the optimal model for requests like requesting it to create a function that computes the square root. Nevertheless, by following the on-screen instructions, the model might be a helpful technical aid. Tokens are used in the model's Fill-in-the-Middle method to determine the prefix, middle, and suffix of the input and output. Only content with permissive licenses is included in the pretraining dataset that was used to choose the model, enabling the model to produce source code word for word. However, it is crucial to follow any attribution requirements and other guidelines set forth by the code's license.
The new VSCode plugin is a useful tool to complement conversing with StarCoder during software development. Users can check whether the current code was included in the pretraining dataset by pressing CTRL+ESC.
While StarCoder, like other LLMs, has limitations that can result in the production of incorrect, discourteous, deceitful, ageist, sexist, or stereotypical information, the model is available under the OpenRAIL-M license, which sets legally binding restrictions on its use and modification. Also, researchers evaluated StarCoder's coding capabilities and natural language understanding by comparing them to English-only benchmarks. To expand the applicability of these models, additional research into the effectiveness and limitations of Code LLMs in various natural languages is necessary.
AI-powered coding tools can significantly reduce development expenses while freeing up developers to work on more imaginative projects. According to a University of Cambridge research, engineers spend at least half of their time debugging rather than actively working, which is projected to cost the software industry $312 billion annually.