Slack has been working on making load testing a core concern for all engineers, not only those focusing on performance, and moving from a reactive approach to performance to a more integrated effort, say Slack engineers Shreya Ramesh and Melissa Khuat.
Load testing is a key practice to ensure no performance regressions sneak into a release. Unfortunately, load testing can be quite time consuming, especially due to the necessity of setting up a complex test environment. This can end up having the effect of discouraging engineers, say Ramesh and Khuat.
To tackle this problem, Slack took a quite radical stance. Instead of relying on ad-hoc load testing, they made it a standard feature of their development process. This allowed them to identify performance issues early and incrementally by continuously running load tests.
Having clients booted and running all the time removes the extra effort required by engineers to boot up clients and set up their custom environment. When we deploy builds to production, we can immediately run tests against a large [test] organization with a high number of active users to ensure we didn’t introduce any performance regressions.
Continuously running load tests does not only have a technical challenge to it, but also requires some caution to avoid tests impairing proper platform behavior, to ensure efficient operation, and that the whole organization understands it. So, for this approach to work effectively, the Slack team had to work on three key aspects: safety, resilience, and communication.
To improve safety, Slack engineers provided two mechanism: an Automatic Shutdown service, triggered when API success rate remains below 95% for five minutes, and an Emergency Stop service which can be triggered manually by engineers.
Resilience was achieved by making the test services save their state to a database. This enabled their automatic restart when, for example, a new release is deployed or the services themselves must receive a security patch. Another important aspect of resilience is the automation of all steps required for testing, e.g. token generation, to reduce engineering intervention.
Finally, communication was key to ensure the organization kept its trust in the testing tools. This included a gradual ramp-up process to minimize surprises for any parties involved. Similarly, a careful rotation strategy was required to ensure the team kept high availability to respond to any incident that could occur.
According to Ramesh and Khuat, continuous load testing brought a number of benefits, including better understanding the performance expectations of their largest customers; being able to react to any production incident quickly to attempt to reproduce the issue; and detecting performance regression before they make it to production.
There is much more to the strategy Slack followed to integrate continuous load testing into their pipeline than what can be covered here. So, make sure you do not miss the original article to get the full details.