BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News How to Achieve a Resilient Architecture

How to Achieve a Resilient Architecture

This item in japanese

To manage systems at scale, you must push your system almost to the breaking point, but still be able to recover, and also embrace failures, Adrian Hornsby writes in two blog posts sharing his experiences from working with large-scale systems for more than a decade, and the patterns he has found useful.

Hornsby, a technical evangelist at AWS, notes that for smaller systems (up to a few tens of instances), a fully operational mode is commonly the normal state, and failures are rare. To achieve this in large-scale systems is almost impossible; instead, partial failure is the norm. He notes that for most web-based applications this is not a huge problem, although it may affect revenue. To mitigate this, his recommendation is to find a good balance between the cost of being resilient and the possible loss of revenue.

Hornsby describes several patterns that he believes can help in building resilient architectures, but he emphasizes that resilience is not just about software. The infrastructure layer, network and application design are also important, together with people and culture.

Redundancy
One of the most important things for Hornsby when deploying an application in the cloud is redundancy – increasing the availability by deploying several instances, possibly in different zones or regions.

Autoscaling
The next step for Hornsby is to enable an automatic adjustment of the capacity of an application according to the demand, a mechanism commonly available today. Different technologies for autoscaling work at different speed, and it is therefore important to choose the one that fits the need of the application. He also points out that scaling is much faster today because of container platforms and functions.

Infrastructure as code
Repeatability is one important benefit when using infrastructure as code, and he compares the work to manually configure a datacentre for multiple environments against using a template and automatically executing the template numerous times. 

If, or when, an environment is compromised in some way, or even deleted, you can restore all data from backups and use the template to rebuild everything. This will be much faster and less risky than doing all the work manually.

Hornsby also sees infrastructure as code as knowledge sharing. This code can be treated the same way as other code, with teams working on it, maybe also using pull-requests to verify changes.

Immutable infrastructure
Immutable infrastructure means that for every deployment, all components are replaced; no updates are done in place. Hornsby notes two rules based on the immutable server pattern:

  • No updates should ever be done on live systems.
  • You must always start from a new instance of the resource being provisioned.

When working with immutable infrastructure, Hornsby recommends using canary deployment to reduce the risk of failure when new versions of an application are deployed. Using this technique, you can test in a real production environment and implement a very fast rollback if needed.

Stateless application
To be able to use autoscaling and immutable infrastructure, the application must be stateless. This means that all requests must be handled independently of earlier requests or sessions, and no information may be stored on local disks or in memory. Sharing state within an autoscaling group can only be done using an in-memory object caching system like Memcached or a similar product.

Avoiding cascading failures
In Hornby’s experience, a common cause of outages is cascading failures where a sometimes small, local failure through different types of dependencies takes down an entire system. One common example is overload, for example, when one cluster goes down and all its traffic moves to another cluster. To avoid these kinds of failures he recommends a few patterns, including:

  • Timeouts. A database slowing down but with the same amount of incoming traffic may quickly make a system fail. Timing out faster may mean the service is degraded instead of failing.
  • Idempotent operations. Due to transient errors, clients may send the same request multiple times and this can cause failures. To avoid this, Hornsby favours the use of idempotent requests – requests that can be handled repeatedly without problems.
  • Service degradation and fall-backs. One option when dealing with a high load is to offer a variant of a service that is less demanding. One example is to return generic lists of information instead of a personalised list.
  • Rejection. The final act of self-defence is to start dropping requests, preferably less important ones first.

Hornsby wraps up by noting that when dealing with large-scale distributed systems in the cloud, intermittent errors are the norm. Knowing how to react on these errors can be difficult, but he recommends collecting statistics, and from that create thresholds for when to deal with the errors. Lastly, he emphasizes the importance of automation. To get resilient and reliable applications with well-tested deployments and fast spin-up times, you must automate as much as possible.

Rate this Article

Adoption
Style

BT