BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News AWS Publishes Best Practices Guide for Operational Dashboards

AWS Publishes Best Practices Guide for Operational Dashboards

This item in japanese

AWS recently added to the Amazon Builders' Library their best practices for building dashboards for operational visibility. The document includes a detailed description of the different types of dashboards that exist at Amazon, as well as a discussion of the best design practices used to create dashboards.

John O'Shea, principal engineer at AWS, authored the addition to the Builders' Library. He indicates that AWS uses dashboards as one mechanism to stay informed as to the state of their services. Dashboards provide a human-facing view into how systems are operating. However, O'Shea clarifies that "we have found that any operational process that requires a manual review of dashboards will fail due to human error, no matter how frequently the dashboards are reviewed." To address this, they focus on creating automated alarms to evaluate the most important data that is emitted by systems. In some cases, these alarms trigger automatic remediation workflows.

Amazon does leverage dashboards during oncall incidents. Operators utilize the dashboards to assist in troubleshooting and isolating issues. The other primary usage that O'Shea mentions is during the weekly operations review meetings. These meetings are attended by senior leaders, managers, and engineers. Using a tool they call the wheel of fortune, a team is selected at random to display and discuss customer experience and service-level objectives via their system dashboards.

To assist in designing consistent and useful dashboards, Amazon has created a common set of design principles to follow. In order to improve and evolve these principals, they have found ways to measure their effect. One such measurement is how quickly a new operator can get up to speed on understanding and using the dashboards. This metrics-driven approach aligns well with the techniques and strategies that Camille Fournier discussed in her recent InfoQ interview on how internal platform teams can deliver more effective products.

One principal is to work backwards from the expected end user in order to ensure that the dashboard can meet their needs. As O'Shea notes, "It’s easy to build a dashboard that makes total sense to its creator. However, this dashboard might not provide value to users." As they have found that users tend to interpret the graphs that render first as most important, the convention states that the most important graphs are placed at the top. The most important for web services tend to be aggregate or summary availability graphs and end-to-end latency percentile graphs.

Some of the other design principles include:

  • Ensure a consistent time zone for display (and display it on the dashboard)
  • Lay out graphs for the expected minimum display resolution
  • Enable the ability to adjust the time interval and metric period
  • Annotate the graphs with alarm thresholds and goals
  • Use alarm status, simple numbers, or time series graph widgets where appropriate

O'Shea discusses the different types of dashboards in use at Amazon. The most important and widely used type is the customer experience dashboard. They are designed to be used by a wide range of stakeholders from service operators to management. This style of dashboard presents metrics on overall service health and current progress against goals. The data that is presented is set up to answer questions like "How many customers are impacted?" and "Which customers are most impacted?"

How the various dashboards provide views into different layers of the systems

How the various dashboards provide views into different system layers (credit: Amazon)

Dashboards are also created at the system level, the service level, and to audit a service across all regions. These allow for different views on how a system or service is operating. The system dashboard should contain enough information to see how the system and any endpoints are behaving, whereas the service level dashboard should drill into a single service instance providing a narrow view to allow for deeper troubleshooting.

The guide finishes with a discussion on dashboard maintenance. According to O'Shea:

Maintaining and updating dashboards is ingrained in our development process. Before completing changes, and during code reviews, our developers ask, "Do I need to update any dashboards?" They are empowered to make changes to dashboards before the underlying changes are deployed.

This approach aims to instill creating and maintaining dashboards as a part of the culture. As Tyler Treat shared in a recent InfoQ interview, "As with most things, it starts with culture. You have to promote a culture of observability. If teams aren’t treating instrumentation as a first-class concern in their systems, no amount of tooling will help.".

In addition, during post-mortem discussions, teams are encouraged to investigate if improvements to dashboards, and subsequently the automated alarms, could have preempted the issue or allowed for faster identification. These dashboard changes are deployed using the same tooling as used for their services with version control and infrastructure as code being core practices.

The full article is a part of the Amazon Builders' Library. The library contains a variety of documents describing and exploring how Amazon builds, maintains, and operates their software.

Rate this Article

Adoption
Style

BT