As announced in Venture Beat, Yahoo has open-sourced DataSketches, a library written in Java for stochastic streaming algorithms. DataSketches is able to perform traditionally expensive operations, like counting distinct occurrences of a variable within a stream, using a fraction of time and memory and with a predictable error margin.
As indicated in their technical blog, Yahoo has been using DataSketches internally to improve the performance of a number of their products, including Flurry. DataSketches is based in the concept of a sketch, a "summary" of a stream where each update is handled in the same way regardless of the history of updates. This concept is at the core of the performance of DataSketches, since traditional stream processing would need to keep a history that would grow over time. For instance, if counting the number of unique occurrences, each new unique appearance would have to be saved, increasing the time that it takes to check for the uniqueness of later values; each update is therefore handled in a different, more expensive way. On the other hand, a sketch is constructed in such a way that only a fixed amount of information needs to be saved, which means that all updates are performed in exactly the same way.
If we look into the science behind DataSketches, we find that it is based on the Theta-Sketch Framework, a generalisation of the KMV and Adaptive Sampling algorithms. The paper can be read to obtain a formal description and demonstration of the properties, but here we will offer a simplified, more intuitive description.
Let's try to understand this problem but putting it in the context of calculating unique visitors to a website in real time. The main problem of calculating distinct occurrences of a variable within a stream of data is that a copy of each known distinct value of the variable needs to be kept. In addition to this, each new instance of the variable (eg, each new visit to the website) needs to be checked against the list of known distinct values to check if this constitutes a new or an existing visitor. This means that, assuming the number of unique visitors is N, the system will have a memory requirement of O(N) and each hit to the website will take O(log N) to check whether we have a unique visitor.
The strategy followed by KMV (K-th Minimum Value) is based on storing a reduced number of values, k, from which N can be estimated with a fixed margin of error. The values to store are calculated using a hashing function that maps the variable to be measured (unique page visits in our example) to a number in the range 0-1; it doesn't really matter what this hashing function is, as long as the results are uniformly distributed among the 0-1 range. At every new instance of the variable being measured, we calculate its hash and check if we already have it stored, if not, we store it. The main difference resides on the fact that, at any moment, only the k lowest values are going to be kept: if a new value is added to the group, then the (k+1)-th value will be removed, ensuring that memory cost stays at O(k), and time cost at O(log k). With this in place, the number of distinct occurrences is estimated as (k-1)/KMV, with KMV being the k-th minimum value, or the highest hash that has survived within the stored group.
By inspecting the resulting expression is easy to deduct that, if we compare two streams of data, one with more distinct occurrences than the other, the one with more distinct occurrences will produce a higher number of hashed values, and therefore the k-th stored hash will be lower than that of the other stream's. A lower k-th hash, for the same k, will result in a higher value in the expression above, which leads to conclude that the expression is at least proportional to the actual number of distinct occurrences.
A formal demonstration of the expression above being a good estimation is available in several research papers, although anecdotical evidence can be provided with a simple experiment. Let's assume a stream of data where they are 199 unique occurrences, and that we are using k=20 in our algorithm. Assuming a hashing function that distributes the results evenly in the range 0-1, those 199 unique occurrences will be mapped approximately to 0.005, 0.01, 0.015, etc., up to 0.995. If we keep the 20 lowest values, the 20th one will be 0.1, which applying to the expression above results in (20-1)/0.1 = 190.
Besides its performance, DataSketches has other properties like the ability to combine sketches that have been calculated separately and obtain a combined result without having to go through the underlying data. This allows users to calculate data of individual groups or partitions of data, and then combine them on demand when needed. DataSketches is available as a library in Maven Central, and as adaptors for Hadoop Pig, Hadoop Hive and Druid.