Grafana has released outlier detection as part of their Grafana Machine Learning toolkit. Outlier detection can be used to monitor a group of similar things and be alerted when some of them start to behave differently than the norm.
The launch includes two different detection algorithms: DBSCAN and MAD. Density-based spatial clustering of applications with noise (DBSCAN) is best suited for series that move together over time or have strong trends with their data. DBSCAN clusters data points based on their density and distances. It flags if the series has data points outside the most significant cluster. As it works with a rolling window, the band of normal behavior will move with the data.
Median absolute deviation (MAD) works best when all the group members move within a stable band of normal behavior. It is less affected by out-of-sync events such as instances restarting at different times. It compares the distance of each data point at each timestamp to a rolling 24-hour median. It flags if the series has data points outside the chosen sensitivity threshold.
The Outlier Detector query determines the series to compare and the baseline group. It supports any metric query with three or more series. The query should be filtered so that the baseline group presents a similar profile. Once the query is created, the algorithm is selected along with setting the sensitivity. A higher sensitivity will result in more outliers and potentially more alerts.
Outlier Detector supports sending alerts via Grafana Alerting. By default, the alert will notify if any one pod is misbehaving. It is possible to adjust the alert to be an aggregated outlier-based rule. For example, the default alert rule for an Outlier Detector query named web_api_cluster_cpu_usage
is web_api_cluster_cpu_usage:outliers
. To change the rule so that it only fires if more than 10% of the group is behaving differently:
(sum(web_api_cluster_cpu_usage:outliers) / count(web_api_cluster_cpu_usage:outliers)) > 0.10
Grafana Machine Learning currently supports the following data sources: Prometheus, Graphite, Loki (for metric queries only), Postgres, InfluxDB, BigQuery, Snowflake, Splunk, and Datadog.
The default usage limits are 1000 series per outlier detector and 10 outlier detectors per instance. Grafana has stated that these limits can be increased by request. Outlier detection is available for no additional charge for Grafana Cloud customers with Pro, Advanced, or Custom plans. Questions can be brought to the #machine-learning channel of the Grafana Labs Slack workspace.