Last week Amazon Web Services and Rackspace notified customers that their servers would be rebooted as a part of Xen hypervisor patching.
According to an official AWS blog, an undisclosed security flaw in Xen hypervisor impacted 10% of the servers running in nine regions got impacted by detected. This translates to a significant number of VMs running a variety of customers workloads. A day later, Rackspace announced that certain VM types need to be rebooted as a part of scheduled maintenance.
This is the second time that AWS has scheduled reboots as a part of its cloud infrastructure maintenance. During December 2011, Amazon was criticized for forcing customers to reboot their EC2 instances to receive patch updates. The official Amazon EC2 maintenance help page states that there are two kinds of reboots that can be required as part of Amazon EC2 scheduled maintenance – instance reboots and system reboots. Instance reboots are reboots of the virtual instance, and are equivalent to an operating system reboot. System reboots require reboots of the underlying physical server hosting an instance.
RightScale, the multi-cloud management platform provided additional guidance on handling the reboots. According to a blog post, customers running workloads that can tolerate a short reboot need not do anything. Amazon will automatically reboot and the VMs would start functioning normally after that.
It is common for cloud providers to schedule downtime for patching and maintaining the physical hosts running the cloud infrastructure. Some of the recent developments enable IaaS providers to support transparent maintenance. Google Cloud Platfrom is one of the first to support transparent maintenance on Compute Engine, its IaaS offering. Technically called as Live Migration, this feature relocates customer VMs from the hosts that are being patched to new physical hosts without shutting them down. This enables Google to perform patching, upgrades and maintenance of its data centers without involving scheduled downtime or reboots to customer VMs. One of the recent entrant, VMware also supports live migration on vCloud Air, its public cloud. Having gained the experience of developing a service called vMotion that moved live VMs across physical hosts running vSphere platform, VMware has extended that to its cloud platform. Microsoft also supports live migration of VMs on its Hyper-V platform.
Post reboots, Taylor Rhodes, CEO and president of Rackspace aplogized to customers. He said that, “This maintenance affected nearly a quarter of our 200,000-plus customers, and in the course of it, we dropped a few balls. Some of our reboots, for example, took much longer than they should. And some of our notifications were not as clear as they should have been. We are making changes to address those mistakes. And we welcome your feedback on how we can better serve you.”
AWS also published a follow up post on their blog to highlight the best practices of building resilient applications. Cloud providers including AWS recommend that customer workloads running on the cloud platforms should be designed to be resilient to reboots and restarts, which will not only save them from manual reboots but also turns the applications more scalable.
Jeff Barr, chief evangelist at Amazon had the following suggestions:
- Run instances in two or more Availability Zones.
- Pay attention to your Inbox and to the alerts on the AWS Management Console. Make sure that you fill in the "Alternate Contacts" in the AWS Billing Console.
- Review the personalized assessment of your architecture in the Trusted Advisor, then open up AWS Premium Support Cases to get engineering assistance as you implement architectural best practices.
- Use Chaos Monkey to induce various kinds of failures in a controlled environment.
- Examine and consider expanding your use of Amazon Route 53 and Elastic Load Balancing checks to ensure that web traffic is routed to healthy instances.
- Use Auto Scaling to keep a defined number of healthy instances up and running.