A full snapshot of more than 2.8 million open source project hosted on GitHub is now available in Google’s BigQuery, Google and GitHub announced. This will make it possible to query almost 2 billion source files hosted on GitHub using SQL.
GitHub’s BigQuery dataset is based on the GitHub Archive Project, a project that aims to take snapshots of GitHub at specific points in time, and to store and make them accessible for further analysis. Thanks to GitHub’s BigQuery dataset, now the content of the GitHub Archive Project is readily available through arbitrary SQL-like queries.
According to Arfon Smith, program manager for open source data at GitHub, the new BigQuery dataset could be used for example to find out which are the most commonly used Go packages, or which US schools have the most open source contributors. He also says that it can be useful overall to researchers studying open source communities, or the latest trends in development.
Google developer advocate Felipe Hoffa adds a few more examples of possible uses, such as finding every project that is using a given open source library, or analyzing the way it is being used to collect useful data to decide about that library's future development.
In a post on Medium, Hoffa lists a few queries that have been created by Google engineers and others to analyze Go programs, find the most used Java imports, the top angular directives, and the top emacs packages.
GitHub’s BigQuery dataset contains about 1.5TB of data and is automatically updated every hour. To get started with it:
- Log in into the Google Developer Console
- Create a project
- Activate the BigQuery API
- Open GitHub’s public dataset and execute a query.
Google provides 1 TB of data processed per month free of charge, but, as Google developer advocate warns, a single query against the main dataset (bigquery-public-data:github_repos.contents) will consume the free terabyte. Instead, he suggests using the 23GB official extract (bigquery-public-data:github_repos.sample_contents) or any of the language-focused extracts for popular languages such as Go, ruby, JavaScript, PHP, Python, and Java that Google is making available. BigQuery can also be used to create custom datasets, but in this case the user will be charged for its storage.
Google BigQuery Public Datasets is a collection of datasets Google makes available through BigQuery under a special plan where users are only charged for the queries they perform, but not for the dataset storage. Other datasets available among Google BigQuery Public Datasets are USA names, Hacker News stories and comments since 2006, global climatology data between 1029 and 2016, and more.