AWS released a new EC2 auto recovery feature in the US East (N. Virgina) region, which is designed to increase instance availability by automatically recovering supported instances when a system impairment is detected.
The Elastic Compute Cloud (EC2) auto recovery feature can be triggered when an AWS CloudWatch monitoring service status check detects a problem with the underlying physical host that will affect a running EC2 instance. Examples of problems that cause system status checks to fail include: loss of network connectivity, loss of system power, software issues on the physical host or hardware issues on the physical host.
The auto recovery feature automatically attempts to recover an affected EC2 instance on new underlying hardware, which removes the need to manually migrate to a new instance. The instance is recovered during an instance reboot, and any data that is in-memory is lost. The recovered instance will be identical to the original, including the same instance identifier, IP address(es) and configuration.
Auto recovery for an instance is enabled by creating an AWS CloudWatch alarm, choosing the “EC2 Status Check Failed (System)” metric and selecting the “Recover this instance” action. If an AWS Identity and Access Management (IAM) account is used to create or modify a CloudWatch alarm, the corresponding Amazon EC2 permissions must be assigned to the account for the auto recovery feature to recover the instance: ec2:DescribeInstanceStatus, ec2:DescribeInstances, ec2:DescribeInstanceRecoveryAttribute and ec2:RecoverInstances.
The AWS EC2 documentation states that currently the new EC2 auto recovery action is only supported on:
- C3, M3, R3, and T2 instance types.
- Instances in the US East (N. Virginia) region.
- Instances in a VPC.
- Instances with shared tenancy (where the tenancy attribute of the instance is set to default).
- Instances that use Amazon EBS storage exclusively.
The recovery action is not supported for EC2-Classic instances, dedicated tenancy instances, and instances that use any instance store volumes.
The automatic recovery process will attempt to recover the affected instance up to three times for three unique failures. If the instance system status check failure persists, the AWS documentation recommends that the instance be manually started and stopped. The instance may subsequently be retired if automatic recovery fails and hardware degradation is determined to be the root cause for the original system status check failure.
The AWS auto recovery documentation states the following issues can cause automatic recovery of an instance to fail:
- Temporary, insufficient capacity of replacement hardware.
- The instance has an attached instance store storage, which is an unsupported configuration for automatic instance recovery.
- There is an ongoing Service Health Dashboard event that prevented the recovery process from successfully executing. Please refer to http://status.aws.amazon.com for the latest service availability information.
- The instance has reached the maximum daily allowance of three recovery attempts.
When the auto recovery process is complete, an email notification that includes the status of the recovery attempt and any further instructions is sent.
As the EC2 instance virtual machine (VM) is rebooted during the auto recovery process, this feature is not the same as the VM live migration offered by Google Compute Engine (GCE) live migrate or KVM migration, in which guest VMs should not experience noticeable downtime.
More information on the AWS EC2 auto recovery feature, including full instructions on how to enable this feature, can be found in the AWS auto recovery documentation.