Amazon has updated their AWS Well-Architected Framework (PDF) based on feedback from clients, adding a new pillar, Operational Excellence.
The AWS Well-Architected Framework contains a set of best practices for building and operating secure, efficient and cost effective systems in the cloud. The architectural guidelines were put together by Amazon for AWS customers, but they are generally useful for any cloud platform.
The framework was first published a year ago and now it has been updated including feedback from customers and lessons learned using it. For those not familiar with the framework, we recommend reading the initial InfoQ article because in this post we will mention only some of the notable changes introduced in this year’s version.
Besides the four original pillars – Security, Reliability, Efficiency, and Cost Optimization – the AWS team of architects has introduced the fifth one: Operational Excellence, which represents the “ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures.“ The best practices recommended to ensure operational excellence of production workloads are:
- Perform operations with code: automate operations as much as possible.
- Align operations processes to business objectives: collect only those metrics that support business needs, responding appropriately to operational events.
- Make regular, small, incremental changes: workloads should consist of components that are updated regularly in small steps without taking down services, and operations should be able to roll back those changes if necessary.
- Test for responses to unexpected events: inject failures in the system to see how it reacts to unexpected operational events. Develop clear procedures to react to such events.
- Learn from operational events and failures: monitor and analyze how a system behaves during various operational events in order to improve it.
- Keep operations procedures current: update procedures and guidelines to accurately reflect the current system as it evolves over time.
The Well-Architected Framework comes with a number of design principles meant to create good systems in the cloud:
- Stop guessing your capacity needs: always use cloud’s scalability capabilities rather than guessing capacity needs and risking providing inadequate capacity.
- Test systems at production scale: scale up the system to what it would be in production and test it to see how it works in the real environment. Decommission the extra resources once the test is over.
- Automate to make architectural experimentation easier: automate the entire process of creating a system, enabling it to be replicated easily. Also, returning to a previous setup is simple that way.
- Allow for evolutionary architectures: automation enables architects to evolve systems as needed, easily testing and setting up new configurations.
- Data-driven architectures: collect needed operational data that can be used to evaluate how architectural changes impact the workloads. The data can also be used to tune up the automation code.
- Improve through game days: inject failures to simulate operational events in production to understand how the system behaves when they take place and correct it if necessarily.
The framework also includes questions and answers for all five pillars on which it is built, providing guidance on how to address practical issues such as protecting against unauthorized use of the AWS root account, planning network topology, responding to unplanned operational events, and many others. We recommend reading the paper (AWS Well-Architected Framework) for an in-depth view of what it takes to create a successful system in the cloud.