Datadog recently published the best practices for monitoring dark launches. The blog post includes a detailed description of dark launches, the different types of metrics and dashboards, as well as the best practices for monitoring them.
Paul Gottschling, a technical content writer at Datadog, authored the blog post in which he explains how to test dark launches to determine success and prevent them from causing infrastructure issues. He uses a hypothetical deployment of a SaaS service for booking office meetings to illustrate the different aspects of monitoring dark launches.
Gottschling recommends establishing a baseline to determine if the dark launch is successful. This is typically done by setting service level objectives(SLOs) and determining service level indicators(SLIs) to track the performance of the service.
Gottschling mentions that in this situation, all monitoring data should be tagged by version. That way, it’s easy to visualize the health and performance of the dark launch alongside the latest release.
For example, the SLIs for the booking service are the uptime of the service, the percentage of requests that result in internal server error (HTTP 500), and the percentage of requests with the 95th percentile response latency over 500ms.
Also, he mentions having alerts and dashboards by version tag. That way, it's possible to alert the appropriate team when the dark launch is not performing well and easily compare the performance of the two releases.
Gottschling notes that an application can appear successful in terms of SLIs, but can still run into bugs while processing data. Therefore, it’s paramount to watch for unexpected responses to client requests. This aims to identify and fix issues with dark launches before they contribute to a poor end-user experience.
This is done by analyzing logs and running automated tests. As he points out, to be able to investigate bugs in the responses of a dark launch, it’s important not only to design the service to emit structured log messages (i.e: JSON) that include key information from the payload, but also tag logs by version. That way, it’s possible to compare responses from the dark launch with responses from the released version of the service.
Gottschling indicates that to be able to determine if a dark launch is ready for prime time, automated tests are needed along with SLI-based alerts and dashboards.
Automated tests check whether a dark launch returns the expected results for a predefined set of user interactions. Those tests should be labeled by version to be able to distinguish failures from the dark launch or the released version of the service. Those tests can be easily integrated into a CI/CD pipeline to query the service and evaluate the response.
Gottschling discusses the importance of scaling up production infrastructure for the dark launch. Also, he mentions that it is equally important to monitor production infrastructure to ensure the dark launch is not negatively impacting it.
This is mainly for two reasons: first, to make sure the production infrastructure has enough capacity to handle the dark launch. Second, to discover any aimless interactions between the dark launch and other parts of the production infrastructure.
Dashboards and automated alerts are created to monitor key resource metrics. These metrics will show any stress on the infrastructure that’s above the resource utilization thresholds. Such automated alerts on these thresholds should be tagged by version to provide context.
Also, the resources of the infrastructure that are used to manage the deployment should be monitored. For example, if a reverse proxy is used to mirror requests between the released version and the dark launch, it has to have enough CPU and memory to handle the volume requests.
In addition, Gottschling points out it’s important to protect user data in production and look for unintended interactions between the dark launch and the persistence layer. There are mainly two ways to protect user data. Either configure the dark launch to have read-only access to the persistence layer or have separate instances of the persistence layer for the sole purpose of interacting with the dark launch.
To identify unexpected interactions, the application code should be instrumented for tracing and then the traces visualized to display a map of requests between services. Also, monitor the dark launch’s persistence layer to spot any unusually heavy reads or writes that are the result of the released service or the dark launch.
Dark launching is a deployment strategy to assess the additional load and performance impacts upon the system before publicly announcing a new capability or feature. Usually, End-users don’t notice any difference.
The full blog post is part of Datadog’s blog. It contains a variety of posts describing best practices and approaches to monitor cloud platforms as well as non-technical topics.