BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Zonal Autoshift on AWS: Optimizing Infrastructure Reliability

Zonal Autoshift on AWS: Optimizing Infrastructure Reliability

This item in japanese

Zonal autoshift, a new capability of Amazon Route 53 Application Recovery Controller, automatically shifts traffic away from an Availability Zone (AZ) when a potential failure is identified by the cloud provider. The service redirects the traffic back once the AZ failure is resolved.

The new zonal autoshift builds upon the capability introduced last year, enabling users to manually or programmatically initiate a zonal shift. As part of Amazon Route 53 Application Recovery Controller, the zonal shift feature allows users to redirect traffic away from an impaired AZ. Both options help customers manage and mitigate the impact of AZ failures, including power and networking outages, on applications and services.

Sébastien Stormacq, principal developer advocate at AWS, explains how the new feature ensures a more resilient and reliable infrastructure in the cloud but warns:

Shifting traffic away from an Availability Zone is a delicate operation that must be carefully prepared. We built a series of safeguards to ensure we don’t degrade your application availability by accident. First, we have internal controls to ensure we shift traffic away from no more than one Availability Zone at a time. Second, we practice the shift on your infrastructure for 30 minutes every week. (...) Third, you can define two Amazon CloudWatch alarms to act as a circuit breaker during the practice run.

Source: AWS Blog

Customers can enable zonal autoshift for Application Load Balancer and Network Load Balancer, if the cross-zone configuration is disabled, using the CLI, the console, or the different SDKs.

Zonal autoshift introduces practice runs, a feature that proactively assesses whether an application maintains sufficient capacity in each remaining zone after transitioning from an impaired one: customers can configure practice runs to execute zonal shifts and verify the workload's capacity tolerance.

It is possible to define blocks of time when automatic practice runs are not allowed, for example during local business hours, defining a CloudWatch alarm to prevent starting the run or an alarm to monitor the workload health during the run and roll back in case of issues. The documentation provides further details on resource capacity prescaling and resource types and restrictions.

The cloud provider recommends initially testing the resilience of a deployment manually through zonal shift before activating the new zonal autoshift. Stormacq suggests:

We recommend applying the crawl, walk, run methodology. First, you get started with manual zonal shifts to acquire confidence in your application. Then, you turn on zonal autoshift configured with practice runs outside of your business hours. Finally, you modify the schedule to include practice zonal shifts during your business hours. You want to test your application response to an event when you least want it to occur.

The "Using zonal autoshift to automatically recover from an AZ Impairment" re:Invent session has recently been published on YouTube.

Zonal autoshift is available at no extra cost in all regions except China and GovCloud.

About the Author

Rate this Article

Adoption
Style

BT