A team of researchers recently presented their paper on KiloGram, a new algorithm for managing large n-grams in files, to improve machine-learning detection of malware. The new algorithm is 60x faster than previous methods and can handle n-grams for n=1024 or higher. The large values of n have additional application for interpretable malware analysis and signature generation.
In the paper presented at the KDD 2019 Workshop on Learning and Mining for Cybersecurity, researchers from the University of Maryland and cybersecurity company Endgame described their algorithm for finding the most frequent n-grams in a large dataset of files. Previous methods encounter "exponential cost" in memory and runtime when increasing the size of n and are constrained to values of n less than eight when analyzing datasets with more than a few hundred thousand files. By contrast, the KiloGram algorithm is able to extract n-grams from 5TB of data in millions of files, while using only 9GB of RAM, and the "runtime does not increase with n." This allows the algorithm to extract n-grams for large values of n, to test whether those n-grams provide better accuracy for machine-learning algorithms.
An n-gram is a unique sequence of n items, and the idea is used in many machine-learning tasks, particularly natural-language processing (NLP). In the case of malware detection, the n-grams are sequences of bytes from a file that is to be classified as either malware or benign. Early work in malware detection suggested that larger n-grams, say n=15 or 20, would be ideal for training detection systems, but the size of modern datasets makes the use of values of n larger than six too costly. Since the KiloGram algorithm can handle those larger values, the research team was able to test the idea that larger values are better.
Using several datasets of executable files and Adobe PDF documents, the team trained an Elastic-Net regularized logistic regression classifier to detect malware; for regression, the input features were the n-grams extract using the KiloGram algorithm. Contrary to suggestions in the literature, the researchers found that "predictive accuracy does not increase beyond n = 8." Larger n-grams produce models with decreased accuracy; however, they have the advantage of interpretability. While the smaller n-grams produce "black-box" models, the larger n-gram feature-set contains sequences of bytes that can have meaning to an analyst; for example, they may represent snippets of code or strings of text.
According to the researchers, larger n-grams are not as accurate when used in regression models because they are more specific to a particular malware attack; in effect, they cause overfitting. However, when used in a signature models, such as Yara, they have the advantage that they have low false-positive rates; that is, although the Yara model may incorrectly mark more files as benign, if it does indicate a file is malware, it is seldom wrong about that. This makes the KiloGram algorithm useful in building a layered system that combines both machine-learning and signature-based models.
In a discussion on Hacker News, lead-author Edward Raff answered several questions about the algorithm. He noted that their code is not open-sourced at this time, as "the government owns the code." He also noted that KiloGrams might be useful in other problem domains:
Bioinformatics is one of the application areas we thought this might be useful! I've chatted with a few, and it appears the applications right _now_ might be more limited due to the need for exact matches. It's something we are thinking about for the future, and think it could be useful. There may also be some applications for this in NLP. I'm chatting with a professor who thought this could help speed up some old algorithms that weren't scalable before.