Introduction
As with any Internet company with tens of millions of users, at MyHeritage, we amass an enormous amount of data - anything from user engagement metrics to subscriptions and revenues.
Years ago, when we had several dozen charts, it was easy to manually monitor each chart on at least a daily basis, but as the number of charts grew to several thousands, dealing with all the data became increasing challenging.
We don’t only use this data for analyzing how the business is progressing. We also use it as an important tool for detecting bugs and catching production problems at the earliest possible stage.
We had to come up with a way to quickly go over all the charts and spot anomalies as soon as they happened, and not just once a week (at best).
The article describes the general outline of the Stats Anomalies Detector we developed at MyHeritage and provides a detailed explanation of how to enhance the code (will be available soon at MyHeritage GitHub) to meet your company’s needs.
[Click on the image to enlarge it]
Figure 1: Sample of charts from MyHeritage dashboard
A background to anomalies detection
Anomalies detection is the identification of an item, or series of items, which do not conform to other items in a series of data. In some cases data anomalies might identify a welcomed behavior (e.g. number of visitors to a web site has increased due to a successful campaign), but in others it might indicate unwanted behavior such as hacking or a fraud attempt to an operational web site or database.
Challenges in anomalies detection
Anomalies detection in practice is not as trivial as it appears and it raises a few challenges:
1. Differentiation between welcomed and unwelcomed behavior
In some cases an “anomaly” in the dataset is welcomed. Our development teams are highly satisfied when the home page load time is dramatically decreased, though when the numbers are very low it becomes suspicious and we do want to treat that as an anomaly. On the other hand, an increase in the home page load time, even a slight one, should be reported (and later on investigated by development teams).
2. Relationships between datasets
There are cases where datasets are connected. For instance there is a correlation between the number of processed items in batch processing and the number of failures during such processing. Let’s assume that in a single day the average number of processed items has increased; the number of failures will also therefore be expected to increase and we are not supposed to treat this increase as an anomaly, as long as the growth in the number of failures is linear with the growth in the number of processed items.
3. Choosing the best algorithm is not an easy task
There are several well-known algorithms for anomalies detection, some of them are highly sensitive to “noise” in the data, while others are “too tolerant” of abnormality. Selecting the correct algorithm sometimes requires analyzing large datasets, human intervention for manual configuration of the correct algorithm for a specific dataset, and more.
[Click on the image to enlarge it]
Figure 2: An example challenging chart for automatic analysis
Techniques in anomalies detection
Anomalies detection techniques can be grouped into five categories. In this section we’ll briefly refer to each group (see references for more detailed information).
Kernel methods:
Techniques under these methods make use of pairwise distance between time series. Such methods compute the inner products between the images of all pairs of data in the feature space, which is often cheaper than the explicit computation of the coordinates.
Kernel methods can be thought of as instance-based learners: rather than learning some fixed set of parameters corresponding to the features of their inputs, they instead remember their training examples and learn weights for these.
Strengths and weaknesses:
The drawbacks of similarity-based techniques are that whilst they can determine if the entire time series is anomalous or not, they cannot exactly locate the anomalous subsequence. To localize the exact region(s) within the time series that contains the anomaly, one needs to do post processing of the time series.
In detecting different types of anomalies, similarity-based techniques might fail when a single observation is anomalous in the time series, as its effect might not be prominent when the entire time series is considered at once.
The advantage of these techniques is that they can detect anomalies such as anomalous subsequences in a time series, or an anomalous time series as a whole.
Window methods:
Techniques in this category extract fixed length windows from a test time series, and assign an anomaly score to each window. The per-window scores are then aggregated to obtain the anomaly score for the test time series.
Strengths and weaknesses:
The drawback of window-based techniques is that the window size has to be chosen carefully so that it can explicitly capture the anomaly. The optimal size of the window depends on the length of the anomalous region in the anomalous time series.
Another drawback of window-based techniques is that they are computationally expensive and polynomial in the length of the dataset and training series.
The advantages are that window-based techniques can capture all the different kinds of anomalies mentioned as a challenge earlier, since the dataset is broken into smaller sequences.
Predictive methods:
Predictive techniques use historical data to predict the future, using statistical models. When observed values do not match the predicted values, an anomaly is assumed.
Strengths and weaknesses:
As for window-based methods, the length of the history chosen is critical to exactly locate the anomaly – it affects both performance, which might be poor if the chosen segment is incorrect, and computation time, which increases significantly if the segment is too large.
All the prediction-based techniques use fixed-length histories. For the same time series, sometimes a smaller history is sufficient for prediction but other times we might need a longer history. Thus, one can use dynamic length history wherein if an observation cannot be predicted with high confidence given an m-length history, then increase or decrease the history length to predict the observation with higher confidence.
Prediction-based techniques calculate an anomaly score for each of the observations in the time series. Hence they can capture all kinds of anomalies - an anomalous observation in a time series, an anomalous subsequence in a time series, or an anomalous time series as a whole.
Segmentation methods:
The basic approach for segmentation-based anomaly detection techniques is to learn a series of discrete states using a given time series and then define a Finite State Automaton on the states.
Strengths and weaknesses:
The assumption of segmentation-based techniques is that all the training time series can be partitioned into a group of homogeneous segments. In cases where this cannot be done, segmentation-based approaches might fail.
Segmentation-based techniques help capture all kinds of anomalies - an anomalous observation in a time series, an anomalous subsequence in a time series, or an anomalous time series as a whole.
HMM (Hidden Markov Model) methods:
The HMMs used to detect anomalies are designed and trained using Genetic Algorithms (GAs). The use of GAs helps with automating the use of HMMs, by liberating users from the needing to have statistical knowledge. Instead, the HMM methods gets as input sets of valid inputs as a training series, and uses these as the basis for subsequent decisions.
Strengths and weaknesses:
The issue of HMM-based techniques is their assumption that a hidden Markovian process exists which generates the normal time series. In the absence of an underlying Markovian process, these techniques might fail to capture the anomalies.
HMM-based techniques build a Markov model for the time series. Hence they estimate the anomalous behavior for each of the observations in the time series which helps in capturing all kinds of anomalies (as long as the assumption holds): an anomalous observation in a time series, an anomalous subsequence in a time series, an anomalous time series as a whole.
Anomaly detection at MyHeritage
Since there are many possible algorithms that can be used to detect anomalies in a time series, this was a classic case of the strategy pattern. A couple of well-known examples are standard deviation calculations and the Holt Winters method. Instead of doing complex research, and under the basic guideline of getting something working quickly then creating a more complex solution further down the road, we started with a simple algorithm:
- Fetch an array, say of size N, of data for a single chart.
- Declare its most recent, say m, data points to be the “new” data.
- Compute the mean of the “old” data.
- Check if any of the “new” data points deviate from the mean by more than some set default value.
You can obviously choose to use a different algorithm by creating a new derived class, like we did in a later phase of the development with a Standard Deviation implementation, and comparing average data with reference dates (e.g. try to understand if a value for a given day is abnormal when comparing it to the same day in the previous week, the same day two weeks ago and so forth).
Design and implementation
The entire process
The way the entire process runs is quite simple:
- Fetch a list of charts to analyze.
- Open an output stream (for example, building an email sent to the team).
- For each chart:
- a. Decide on which data anomaly detector we should use based on the chart data.
- b. Call your data anomaly detector to:
- a. Fetch the relevant configuration – for instance override defaults for treating a value as an anomaly (e.g. for a specific chart allowing a larger deviation from the average or allowing deviation of 50% if the value is above the average but only 10% if the value is below the average).
- b. Based on the data and configuration find the anomalies.
- c. Have the result of the anomaly detector go through a “report generator” that creates readable output both for anomalies and the actual “raw data” that was used for the calculations.
- d. Have the report handed to the output stream.
- Close the output stream.
Design issues
By keeping the major classes and interfaces decoupled, it is quite easy to inject different behaviors. It should be very easy to use different algorithms for anomaly detection, different output generators (email, html pages, text/csv files), and to render the reporting with different report generators.
Flow and a few more details
The flow chart of the entire process described above looks like this:
[Click on the image to enlarge it]
1. The process starts with getting a list of charts that we want to analyze using a ConfigDao. Any implementation of the ConfigDao interface must include the methods
a. getChartsInGroup($groupName)
b. getChartSpecialAttributes($identifier)
The first method is in charge of getting a list of chart identifiers (say chart IDs) to be analyzed. You might want to analyze only a small group of charts every 3 hours, a different group of charts every time you distribute new code, and analyze all charts once a week.
The second method should return a simple container object (ChartSpecialAttributes) that might contain things like flags for white lists, special anomaly thresholds, and whatever else you might think of.
An implementation of this interface will probably have more methods for managing the groups of charts and dealing with the special attributes of charts, but these details are left for the specific cases you will run into.
2. Inside the loop for each chart we start with fetching the chart’s data points, using a DataDao. Any implementation of the DataDao interface must include the method
getData($identifier) (getSingleChartData in the flow chart)
Given a chart identifier, this method should get the actual data points to analyze.
We have two different Dao interfaces (ConfigDao and DataDao) to differentiate between actual data and the configuration of the different chart groups that logically belong to different data sources. Also, the ConfigDao should be very simple, but the DataDao might encapsulate more complicated data fetchers from various sources. For example, you might have charts that are based on data that you get from Google AdWords, etc.
3. Now that we have the data points, we use a data anomaly detector decider (DataAnomalyDetectorDecider) to analyze which algorithm is best for the data we have in hand (based on the variance of the data in respect to the mean).
4. After we decide on an algorithm we use the appropriate data anomaly detector to analyze the data. Any implementation of the DataAnomalyDetector interface must include the method
analyzeData($dataArray, $specialAttributes = null)
where $dataArray is the array of data points to be analyzed, and $specialAttributes is a ChartSpecialAttributes object (as mentioned above).
analyzeData should return an object of type DataAnomalyDetectorResult, which is another simple container object that holds the results of the analysis. Note that the data may not be valid, and should be checked before proceeding to analyze the data. For this purpose we have a class DataAnomalyDetectorInitialChecks which has a method to runInitialChecks to make sure that the data is valid, and in turn returns an object type of DataAnomalyDetectorResult.
5. After using the anomaly detector, we get a result. The result however is probably not readable. It might have a code for the type of anomaly that was detected. So, based on your specific requirements, you should use a report generator that interprets the result. Any implementation of the abstract class ReportGenerator must include the method
generateReportStrings($identifier, $dataAnomalyDetectorResult)
This method should get the results from (3) and then produce readable strings to be passed on to (5).
6. Getting the report strings from the report generator, the output generator will update the output. Any implementation of the OutputGenerator interface must include the methods
openOutput() updateOutput($dataIdentifier, $data, $reportStrings) closeOutput()
It should be clear at this point what each of these methods should do.
7. EmailNotification is of course optional and behaves much like the outputGenerator.
Current status and moving forward
The Stats Anomalies detector has already proved itself for us in production and detected situations such as: overuse of one of our external APIs, a multiple-photos upload component that has stopped working since browser’s security policies have changed, and many more.
Having realized that investing in such a tool has very positive ROI we have decided to put more effort into further improvements. We are striving to minimize false positives and add further capabilities.
Thanks to Yoav Bachrach for his involvement in the project.
Yonatan & Ran
References
[1] Deepthi Cheboli. Anomaly Detection of Time Series, 2010.
[2] Varun Chandola, Deepthi Cheboli, and Vipin Kumar. Detecting Anomalies in a Time Series Database, 2009.
[3] Juan J. Flores, Anastacio Antolino, and Juan M. Garcia. Evolving Hidden Markov Models For Network Anomaly Detection, 2009.
About the Authors
Ran Levy is the Backend team leader at MyHeritage for the past 3 years. He has fifteen years experience in the industry as developer and architect in complex and large-scale systems and he is the initiator of “Stats Anomalies Detector” project.
Yonatan Harel has been working at MyHeritage for 3 years as an online marketing analyst and as a back-end developer. He holds a Master's degree in Mathematics from UC Berkeley, and prior to his work at MyHeritage, has been teaching Math at the college level for 4 years.