GitHub's Learnings from Building Copilot, an Enterprise LLM Application

GitHub has published an article containing the lessons they learned in building and scaling GitHub Copilot, an enterprise application using an Large Language Model (LLM) .

In a post on GitHub's blog, AI product leader Shuyin Zhao describes how -- over three years -- they broke the project down into three stages: "find it", "nail it" and "scale it", and successfully launched GitHub Copilot.

In the "find it" stage, GitHub focused on identifying a specific problem that AI can solve effectively in a way that is focused enough to be brought to market quickly and big enough to make an impact.

This encompassed becoming clear on exactly who the problem was intended to help, i.e. by prioritizing helping developers to write code faster and with less context switching. Further, the team chose to focus on just one part of the SDLC: coding functions within the IDE, whilst staying realistic about the capability of LLMs of the time. This allowed the team to focus on having the tool make code suggestions rather than generating entire commits. The team is also committed to ensuring the tool enhances existing tools rather than requiring developers to change their workflows.

"We have to design apps not only for models whose outputs need evaluation by humans, but also for humans who are learning how to interact with AI."

- Idan Gazit, senior director of research for GitHub Next

The "nail it" stage involves iterative product development, emphasising real user feedback from developers using A/B testing. This enabled quick iterations, allowing teams to fail and learn quickly. After a brief experiment using a web interface to Copilot to work with foundation models, the team changed focus to the IDE to reduce the amount of task switching between the editor and web browser, with the AI capability working in the background. A further iteration made GitHub Copilot work on multiple files simultaneously, based on observing developers referencing multiple open IDE tabs when coding.

With the generative AI field advancing rapidly, the team was careful to revisit decisions from the past, with improvements in both the technology and users' familiarity with it sometimes rendering past decisions obsolete. This meant reinvigorating ideas around providing interactive chat and changing decisions based on the fallacy of the sunken cost, for example in reversing a decision to build an AI model for each language when it became apparent that improvements in the LLM allowed one model to handle many languages.

Finally, in the "scale it" stage, the team optimized the application for general availability (GA) by ensuring consistent results from the AI model, managing user feedback, and defining key performance metrics. They also prioritized security and responsible AI use, implementing filters to avoid suggesting insecure or offensive code.

Work to optimize quality and reliability included mitigating LLMs' probabilistic nature, where answers can be unpredictable and vary from one query to the next. Tactics to tackle this included changing the parameters sent to the LLM to reduce the randomness of the responses, and caching frequent responses to reduce variance and also improve performance.

Using a waitlist, GitHub managed the influx of early users onto the technical preview. This means the team could manage the comments and questions from a small varied group of early adopters. In-depth analysis of real user feedback allowed the GitHub team to identify problematic updates and to evolve the product's key performance metrics, such as how much code generated by Copilot is kept by developers.

Finally, the team's duty to ensure that generated code is secure was prioritized, with filters developed to reject suggestions that may introduce security problems such as SQL injection. Community outreach also raised issues such as Copilot suggestions matching publically available code, which may have licensing or other implications. The team implemented a code reference tool to allow developers to make informed decisions.

For the go-to-market strategy, the team presented the technical preview to several influential community members and also targeted individual users rather than businesses. This helped to garner a broad range of support for the tool on launch, from which enterprise adoption would follow.

The key takeaways include focusing on a specific problem, integrating experimentation and user feedback, and prioritizing user needs as the application scales.

With the adoption of generative AI still in the early days, GitHub is keeping a close watch on the demand and need for tools using generative AI. Read the full article on GitHub's blog.

About the Author

Matt Saunders

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Matt Saunders

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter