A class-action lawsuit has been filed in a US federal court challenging the legality of GitHub Copilot and the related OpenAI Codex. The suit against GitHub, Microsoft, and OpenAI claims violation of open-source licenses and could have a wide impact in the world of artificial intelligence.
GitHub previewed Copilot, an OpenAI-powered coding assistant, in the summer of 2021 and announced its general availability last July. Powered by the artificial intelligence model OpenAI Codex, the service is a cloud-based tool to assist developers in writing new code by analyzing existing code and comments on GitHub.
The litigation was submitted by Matthew Butterick, programmer and lawyer, and the law firm Joseph Saveri, a group specialized in antitrust and class actions. According to the pursuers, by training their AI systems on public repositories the defendants have violated the rights of many developers who posted code under different open-source licenses that require attribution, including the MIT, GPL, and Apache licenses.
In a previous article, Butterick questions how the service had been trained with machine learning using billions of lines of code written by human programmers and argues that the solution should not be a new open-source license:
Some have suggested creating an open-source license that forbids AI training. But this kind of usage-based restriction has never been part of the open-source ethos. (...) By the same token, it does not make sense to hold AI systems to a different standard than we would hold human users. Widespread open-source license violations should not be shrugged off as an unavoidable cost.
Alex Champandard, artificial intelligence expert and co-founder of creative.ai, assesses the case:
Reading through the GitHub CoPilot litigation submitted; although it was pulled off quickly — it's a solid piece of work! The defendants (...) are in a very bad position. The documents show how Codex and CoPilot act like databases; they have three different examples of JS code that is recited verbatim — with mistakes — from licensed sources. (...) The documents then proceed to cast doubt on the claim of FairUse, that even if it was applicable here, it wouldn't help circumvent (a) the breach of contract, (b) the privacy issues, and (c) the DMCA.
In a Twitter thread, Giuseppe Bertone, developer advocate at Swirlds Labs, disagrees:
Developers are liable for what they use: their brain, copy from Slack Overflow, AI tools, pen & paper, etc. GitHub Copilot is just a tool - a toy, currently - like many others. Sue developers that use copyrighted code incorrectly, regardless of why and how they did it.
The litigation is considered the first class-action case challenging the training and output of AI systems and the impacts might not affect only Copilot. Microsoft and GitHub are not the only companies working on ML-powered coding assistants, with AWS unveiling the preview of Amazon CodeWhisperer earlier this year.
According to the Authors Alliance, the lawsuit raises important questions about how researchers can use AI to train and produce outputs using datasets based on copyrighted materials. Jeremy Daly, author of the weekly serverless newsletter Off-by-none, comments:
Who would have thought that AI-generated code that learned from private repositories would result in a lawsuit alleging "software piracy on an unprecedented scale"
Butterick created a separate website with some background information about the case. GitHub, Microsoft, and OpenAI have not yet commented on the lawsuit.