Netflix recently announced,through their company blog, their journey to the cloud is finally complete after more than 7 years. Netflix initially began this journey in 2008 after a major database corruption incident occurred which impacted their ability to distribute physical DVDs to their subscribers.
The database corruption incident prompted Netflix to change the way they looked at their architecture. They moved away from vertically scaled single points of failure towards horizontally scalable, distributed systems in the cloud.
Yury Izrailevsky, vice president, cloud and platform engineering, describes why they chose to use Amazon Web Services as their cloud provider: “We chose Amazon Web Services (AWS) as our cloud provider because it provided us with the greatest scale and the broadest set of services and features.”
The journey to the cloud was a lengthy one for Netflix. Initially they focused on all customer facing services being provisioned in the cloud. This allowed Netflix to take advantage of Amazon’s regions located all over the world which helps them service customers in 130 countries. Once customer facing services had been moved to the cloud, Netflix has since focused on billing systems and employee data management. The last remaining services, used by streaming service, have been shut down in their data center as of January 2016.
From December 2007 to December 2015, Netflix has experienced more than 1000x growth. Izrailevsky attributes Netflix’s ability to on-board and support new customers is a result of leveraging the cloud. “Supporting such rapid growth would have been extremely difficult out of our own data centers; we simply could not have racked the servers fast enough. Elasticity of the cloud allows us to add thousands of virtual servers and petabytes of storage within minutes, making such an expansion possible.”
Not only has the cloud allowed Netflix to scale, but it has also allowed them to increase their availability numbers. Initially Netflix experienced some “rough patches” but they have seen improvements since then which has allowed Netflix to approach the 99.99% uptime goal they have set for themselves. Netflix has been able to increase their availability by building highly reliable services out of traditionally unreliable components. They were able to achieve this through the use of redundant cloud components.
In order to test and validate their redundancy strategy, Netflix has implemented routine production tests using Simian Army. Part of this strategy includes the use of a Chaos Monkey, which will regularly impose conditions that may force components to fail. It ensures that any failure points are exposed early and often so that engineering teams can address issues through controlled exercises instead of discovering them in unplanned outage events.
While it did take Netflix more than 7 years to make the transition to the cloud, they did it methodically and did not resort to a lift and shift strategy. Netflix felt that in order to truly benefit from the cloud, they needed to transition systems to take advantage of cloud-based components instead of bringing data center shortcomings to the cloud. Izrailevsky further explains: “we chose the cloud-native approach, rebuilding virtually all of our technology and fundamentally changing the way we operate the company. Architecturally, we migrated from a monolithic app to hundreds of micro-services, and denormalized and our data model, using NoSQL databases.”