Service monitoring is traditionally based around comparing measurable values, known as KPIs or Key Performance Indicators, against a set of the threshold values. In theory the operations team determines what the thresholds for warnings and alerts should be and sets them. In practice, the operations team often have no idea what these values should be.
For example, the definition of “normal response time” usually varies based on the time of day. In the middle of the night when the server load is minimal, response times should also be minimal. But as the workday starts and server loads increase, the thresholds should be somewhat more lenient.
So the first improvement in Splunk ITSI is adding the ability to set time-dependent thresholds. This allows operations to more closely match the alerts to the expected workload on an hour-by-hour basis. However, this still assumes that operations know what the thresholds should be. That requires a lot of research and needs to be regularly updated to reflect how the user workload changes over time.
Adaptive Thresholds
The machine learning technique known as “adaptive thresholds” helps to deal with this issue. Adaptive thresholds work by analyzing historic data to determine what should be considered normal. In Splunk, this training data can span the last 7, 14, 30, or 60 days. Since the shape of the data can vary dramatically, Splunk supports standard deviation, quantile, and range based thresholds. The adaptive thresholds automatically recalculated on a nightly basis so that slow changes in behavior don’t trigger false alerts.
Anomaly Detection
Anomaly detection looks for unusually large spikes in the data. Specifically, the kind of spikes that are so brief that the normal threshold monitoring wouldn’t catch them.
Spike detection itself is easy; the challenge is figuring our whether or not the spike is an anomaly or just part of the normal operating behavior. Machine learning plays a part in this by looking at the training data for past examples of spikes. If there are no or few spikes in the history that matches the spike of interest, the spike is flagged as severe or minor. On the other hand, if similar spikes often occurs then the anomaly detector will ignore it.