Slack’s engineering team has revealed the load-testing strategy that has become a critical part of their continuous delivery pipeline, and suggests it promotes greater ownership by engineers. While the Slack engineers stated they had minimal load-testing experience, they built it from scratch using Go and used a methodical approach that offers a roadmap for engineers facing similar challenges.
The EKM service was designed to enable users to use their own encryption/decryption keys for data in Slack, including messaging data, so there was a real concern that this would impact performance. Load-testing addressed this risk, but building it into a service empowered engineers across their organisation to adopt it as part of their standard tooling. The team’s continued focus on simplicity, scalability and usability while testing their new enterprise key management (EKM) service unearthed issues unrelated to their intended change. They identified six key stages to their strategy, summarised here:
- Identify critical test paths
- Gather baseline data
- Create consistent test environments
- Consolidate testing tools
- Automate machine provisioning
- Simplify metric tracking and analysis
The team created and used a single command-line interface tool which delivered fine-grained control over rate and concurrency of API calls as well as safety valves. They used a combination of goroutines (lightweight virtual threads) and channels (mechanisms for inter-goroutine communication) to achieve load concurrency. This allowed them to fine tune and simulate different load shapes on any specified APIs. Having built the load-testing service for EKM, they then uncovered a number of unrelated code paths that were generating excessive load on other systems or were performing unnecessary expensive operations.
One key benefit came from gathering all their load-testing tools into a service so they were generating and executing API calls in a single place. Additionally, they invested in toolchain automation to provision a bootstrapped machine with relevant dependencies from a single command line interface (CLI) command in less than five minutes. By automating the provision of new test machines, they were quickly able to spin up and populate both a test and control environment so they could compare results against a baseline. Here is their overview of that service:
The Slack loadtest --bootstrap
script executed the following steps:
- Creates a development server
- Installs the latest version of the Go toolset
- Configures PATH and git
- Clones and installs the loadtest repository and its dependencies
- Performs a simple test to confirm the tooling is installed correctly
The load-test tool executed calls to the APIs, which in turn tested their entire layer of networking from content delivery network to network load balancer, to HAProxy onwards until the request reached their web host. They further optimised by building basic remediation steps into the script so that engineers could spend less time manually debugging common errors.
Cloud service providers like AWS, GCS and Azure enable a wide variety of competing open and closed-source load-testing tools. According to Alexander Podelko, Apache JMeter was the "most popular load-testing tool [of 2018]," having overtaken MicroFocus Loadrunner. The array of open-source tools is continuously growing, however; for instance, the AWS Lambda powered GOAD was selected by some for its preferred UX after showcasing in the 2016 Gopher Gala. According to the Slack team, it was the creation of a load-testing service that pulled all of Slack’s tooling together which made it one of their “force multipliers of developer productivity” and resulted in unrelated service improvements.