Applying machine learning techniques to its rule-based security code scanning capabilities, GitHub hopes to be able to extend them to less common vulnerability patterns by automatically inferring new rules from the existing ones.
GitHub Code Scanning uses carefully defined CodeQL analysis rules to identify potential security vulnerabilities lurking in source code.
To detect situations in which unsafe user data ends up in a dangerous place, CodeQL queries encapsulate knowledge of a large number of potential sources of user data (for example, web frameworks), as well as potentially risky sinks (such as libraries for executing SQL queries).
Manually creating those rules is a task that requires security experts to analyze existing libraries as well as private code to identify existing vulnerability patterns. Due to the sheer number of existing libraries, this is clearly a daunting task. Machine learning could help at that, says GitHub, by making it possible to train a model to recognize vulnerable code based on a large number of samples.
Based on GitHub experiments, it turns out supervised learning, where experts label each code snippet as either vulnerable or safe, works better than unsupervised learning. Instead of asking experts to label millions of snippets, though, GitHub is leveraging existing CodeQL rules as a ground-truth oracle that can tell whether a given code snippet is secure or not. This enables automatically labelling tens of millions of code snippets from over a hundred thousand public repositories with little effort, according to GitHub. The resulting data is used as a training set to build a predictive model.
To be able to evaluate whether this model is actually able to predict new vulnerabilities and not just those already captured by the CodeQL rules that were used as oracles, GitHub has also devised a clever solution. This consists in training the model on labels generated using an older set of CodeQL rules and then testing it against vulnerabilities detected by a newer set of CodeQL rules. Assuming that the newer rule set extends the number of correctly detected vulnerabilities, this approach can actually show whether the predictive model has effectively learned to detect vulnerabilities that were not already comprised in the old set of rules.
A significant aspect of the whole process lies with how GitHub identifies features in a code snippet. Here again, GitHub is leveraging the power of CodeQL rules, which encapsulate rich information. Thus, instead of using NLP techniques to treat code as text, GitHub is able to identify features such as access path, API name, enclosing function body, and so on. Furthermore, besides clearly meaningful features, GitHub can explore potentially interesting features that are less evident as useful to a human eye such as the argument index to a function call.
We generate a vocabulary from the training data and feed lists of indices into the vocabulary into a fairly simple deep learning classifier, with a few layers of feature-by-feature processing followed by concatenation across features and a few layers of combined processing.
At prediction time, a code snippet is transformed into a set of features using CodeQL, which are then sent to the ML model to get back the probability that a given code snippet represents a vulnerability.
The current implementation of ML-based code scanning focuses on a number of the most common vulnerabilities in JavaScript and TypeScript, including cross-site scripting (CWE-79, path injection (CWE-22, CWE-23, CWE-36, CWE-73, CWE-99), NoSQL injection (CWE-943), and SQL injection (CWE-89). Based on current data, GitHub says metrics vary by query, with a recall of approximately 80% with a precision of approximately 60%.
This new feature is available experimentally and is integrated with GitHub Actions.