BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Slack Conquers Deployment Fears with Z-score Monitoring

Slack Conquers Deployment Fears with Z-score Monitoring

An engineer at team communication platform Slack has written about confronting their fear of deployments and successfully implementing a bot to monitor them instead.

Sean McIlroy, a senior software engineer at Slack, documents how they moved from having a rota of developers supervising deployments of their Web app to using a bot to deploy these 150 changes a day. Describing a seemingly scary delegation of responsibility that ultimately boils down to mathematically detecting a spike on a graph, McIlroy explains the reasoning and logic behind giving their ReleaseBot a pivotal role in a detailed blog post.

Deploying changes to a large-scale platform like Slack presents a unique set of challenges. Most of the service runs from a monolith named "The Webapp," subject to hundreds of changes per week. Slack's deployment philosophy revolves around continuous delivery, aiming to swiftly deliver developers' work to customers with rapid iteration based on feedback. However, managing a constant flow of changes, averaging around 150 per day, requires a careful balance to avoid overwhelming the system and minimizing the risk of errors.

Traditionally, Slack relied on Deploy Commanders (DCs), individuals tasked with executing deployment steps during scheduled shifts. However, the rotational nature of DCs and the growing complexity of the system posed challenges in building confidence and expertise. Consequently, the Release Engineering team sought to solve this by focusing on providing clearer decision-making guidelines for DCs.

This led to the development of ReleaseBot, an automated deployment system equipped with anomaly detection and monitoring capabilities. The transition from manual to automated deployments was gradual, as ReleaseBot operated alongside DCs initially, gradually proving its reliability and efficiency in catching issues faster and more consistently than human counterparts. While the prospect of automated deployments initially caused apprehension due to the perceived risks, ReleaseBot's performance surpassed expectations, instilling confidence in its ability to handle deployments autonomously.

ReleaseBot's effectiveness lies in its anomaly detection mechanisms, particularly through the use of z-scores. Z-scores quantify the deviation of data points from the mean, enabling the identification of statistical outliers indicative of potential issues. This is applied using the principle that if an application performs differently after a deployment to how it behaved before, it triggers a "high confidence" signal of a problem and notifies engineers that a problem may need intervention. This is a mathematical technique for detecting a spike on a graph. High confidence signals, triggered by significant deviations from historical data, prompt immediate attention, while low confidence signals, typically governed by static thresholds, serve as supplementary alerts.

Example of using z-score

The frequency and range of these high-confidence signals are used to govern the severity of Slack notifications sent to the team, with a colour scale of white, blue and red used to show how urgently a signal should be looked at. Slack also uses static threshold notifications as low confidence alarms but uses these as input to ReleaseBot to calculate dynamic thresholds, which consider the normal load on and performance of a component at the time of deployment. This uses historical data to differentiate between abnormal spikes and expected fluctuations during deployments. This approach allows Slack to filter out routine variations while flagging genuine anomalies that warrant intervention.

McIlroy closes by highlighting how deployment monitoring is different to normal monitoring, that Slack has exploited this knowledge to build a tool to make deployments less scary, and that they now have more confidence in the tool to manage deployments than in having developers staring at dashboards. Read the full post here.

About the Author

Rate this Article

Adoption
Style

BT