BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Leveraging Data Science to Improve Monitoring

Leveraging Data Science to Improve Monitoring

This item in japanese

At the recent devopsdays Amsterdam 2015, Patrick Roelke, ops engineer at Stylight, an online fashion retailer, contended that monitoring still has lots of issues, such as too many false positive alerts and non-actionable dashboards. Roelke believes that data science can help by eliminating static thresholds and coalescing information from various data sources into a single metric. The talk included a quick overview of monitoring tools that leverage data science: Etsy's Kale, Stack Exchange's Bosun and Twitter's AnomalyDetection.

Mainstream monitoring tools and approaches still suffer from various ills. Alert's signal to noise ratio is too low. Too many false positives lead engineers to create e-mail rules that automate the alert's archival. Dashboards depicting unprocessed data fail to provide actionable insights, such as metrics correlation. These failures are compounded by ever increasing volumes of data and metrics. This state of affairs led to the famous #monitoringsucks hashtag and a GitHub repository that collects info on the subject.

Data science addresses those issues in several ways. It helps to find relationships and patterns data (think of Google Now). Predictive analysis uses statistic techniques and historical data to make predictions about the future. Big data needs effective anomaly detection, i.e. deviations from what can be expected based on past history, so engineers may focus on real issues.

Withing the context of monitoring, Roelke thinks data science bring three main benefits. It helps to pinpoint problems before they hit static, more or less arbitrary, thresholds (e.g., CPU consumption over 80%). Data science makes it possible to group data from a variety of sources, reduce it into a single metric and create alerts on it. Engineers can create more meaningful dashboards that allow them to understand correlations between different metrics.

Roelke gave a quick overview of three tools that leverage data science principles. Etsy's Kale, Stack Exchange's Bosun and Twitter's AnomalyDetection.

Kale has two subsystems, Skyline and Oculus. Skyline focuses on anomaly detection, showing only the metrics that are found to be anomalous. Skyline uses a consensus-based approach to determine if a metric is anomalous. It runs the metrics through a variety of algorithms and if the majority considers the metric anomalous, Skyline marks the metrics as anomalous. Oculus allows for anomaly correlation, by looking for metrics that seem similar to a chosen anomalous metric and graphing them side-by-side.

Skyline, showing anomalous metrics.

Bosun is a monitoring and alerting system whose main differentiator is the workflow it promotes to create alerts, a bit like an IDE. You start by graphing a metric to generate an expression, based on a powerful expression language. You then reduce that expression to a single value using one of a number of reducing functions, e.g., median, average and percentiles. With that single number, which when valued at 0 triggers an alert; it is now possible to create a rule and notification. A distinctive Bosun feature is that it's possible to test the rule against past data to fine-tune it and thus avoid alert fatigue.

Bosun's rule definition screen.

AnomalyDetection is an R (a programming language for statistical computing) package that provides anomaly detection in the context of big data. Twitter faces specific challenges as a social network. On the one hand it is global so there are global patterns to watch out. On the other, different countries and regions have distinct local patterns (e.g., the Super Bowl in the USA, Christmas in christian countries). All this data makes it hard to find anomalies. AnomalyDetection uses sophisticated statistical algorithms to increase the signal-to-noise ratio and thus, make it easier to spot and fix anomalies.

AnomalyDetection graph showing both local and global anomalies.

Rate this Article

Adoption
Style

BT