The US-EAST Region of Amazon's Elastic Compute Cloud experienced heavy outages today. A lot of high-profile sites were down or at least affected - Reddit, Foursquare, Quora, Hootsuite, Heroku, Assembla and Codespaces among them. The reason for the outtage are failing EBS (Elastic Block Storage) volumes - which also power the Relational Database Services - in multiple Availability Zones of the US-EAST data-center in Virginia. It is probable that resilience and recovery schemes that came into effect after network problems overloaded the EBS controllers.
8:54 AM PDT: A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. -- from Amazon AWS Dashboard
News sites like eWeek, InformationWeek and CNN picked up the issue quickly. GigaOm discussed the situation for the equally vulnerable PaaS providers (Heroku, EngineYard and DotCloud) that leverage EC2.
Today, April 21 at 1:41 AM PDT Amazons AWS status page reported: "We are currently investigating latency and error rates with EBS volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region.". This issue has not been fully resolved until now (1:48 PM PDT).
Besides the really good timing with the announced SkyNet attack from the Terminator Movies that was scheduled April 21, 2011 and helpful hints to Amazon engineers on twitter, there have been some thorough responses on the unexpected outage.
@scottmcnealyI said the Network is the Computer, I did not say it had 100% uptime.
@torrenegra: Today is Terminator's Judgment Day (4/21/2011). Skynet was supposed to kill us all. Fortunately for us Skynet runs on Amazon EC2.
@Nicolethebear: Dear Amazon EC2 - have you tried turning it on & off again?
Usually the different Availability Zones with in one EC2 Region are not affected by each other as they are physically separated data-centers with optimized connections to ensure low latency. So architecting systems to span multiple AZ should provide enough risk management to compensate the outage of one or more of those zones. So the availability guarantees of those zones were questioned by several sources. PCWorld discusses that with Gartner analyst Drue Reeves and Reuven Cohen, founder and CTO of Enomaly. Competing cloud provider DotCloud which also relies on Amazon EC2 reports their experience with the failure and points out technical issues with disaster recovery.
Netflix engineers are quoted in a Hacker News thread with having few issues with this problem by spanning multiple Availability Zones ("Netflix is deployed in three zones, sized to lose one and keep going. Cheaper than cost of being down.")
Keith from backdrift.org gives some simple and effective advice on how to cope with such downtimes. For instance using configuration management systems for image setup and updates (e.g. puppet), synchronizing your cloud based data and securing your DNS configuration. A post by Clay Loveless details that further.
For getting early status updates about AWS issues, following @ylastic was recommended by Eric Hammond (Alestic) who describes how to get affected servers back online.
In the aftermath of todays event there will be many questions about the reliability of cloud based applications, the necessary architectural precautions and risk management to be answered. Not just by Amazon but by other cloud providers like VMware's cloudfoundry or Google App Engine. Another topic will be SLA's given by cloud providers - Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments for external connectivity. Neither EBS nor RDS have SLAs.