At QCon San Francisco, held in the Bay Area, USA, Sangeeta Handa discussed how Netflix had learned to build resilience into production migrations across a number of use cases. Key lessons learned included: aim for perceived or actual zero downtime, even for data migrations; invest in early feedback loops to build confidence; find ways to decouple customer interactions from your services; and build independent checks and balances using automated data comparison and reconciliation.
Handa, billing infrastructure engineering manager at Netflix, opened the talk by discussing that there are several known risks with the migration of data that is associated with a change in functionality or operational requirements, including data loss, data duplication, and data corruption. These risks can lead to unexpected system behaviour on the migrated state, performance degradation post-migration, or unexpected downtime. All of these outcomes have a negative customer impact, and any migration plan should explicitly attempt to mitigate these issues.
With a large-scale geographically distributed customer base, many organisations cannot afford to take systems fully offline during any maintenance procedures, and therefore some variation of a "zero-downtime" migration is the only viable method of change. Variants of this approach that the Netflix team uses within a migration that involves data include: "perceived" zero downtime, "actual" zero downtime, and no migration of state. Handa presented three use cases that explored each of these approaches.
The first use case discussed was a migration within the billing application infrastructure. This application handles recurring subscriptions, allows customers to view preview billing history, and also provides many other billing-related functions. The billing infrastructure is busy 24x7 due to the fact that Netflix's 130+ million subscribers are distributed all over the globe -- downtime during any migration is therefore not an option. The plan with this first use case was to migrate the underlying data storage operating system and database from an Oracle system running in a Netflix data center to a MySQL service running in the AWS cloud. The migration involved billions of rows and over 8TB of constantly changing data.
Initial experiments with the snapshotting and reloading of data, and also using Apache Sqoop, proved fruitless. Ultimately, a "perceived zero-downtime" migration approach was taken, which involved copying data at the record level, table by table, alongside the operational read and writes undertaken on the table as part of the daily workloads. This approach allowed the migration to be paused at an arbitrary time, and for failures to be caught at the (business) record level. Confidence in the resilience of the migration was established by comparing daily automated snapshots from the source and target databases, by checking row counts, and by running test workloads and reports on the target database before fully switching over to the new target database as the single source of truth.
During the "final flip", the new target MySQL database had to be fully synchronised with the source database before application data writes could continue, which would result in actual downtime. However, the team implemented perceived zero-downtime by allowing read-only access to the new MySQL database while the synchronisation process completed and by buffering writes within a (AWS SQS) queue. As soon as the database state was synchronised after the flip, the contents of the queue were replayed into the application, which in turn issued the appropriate writes against the new target database.
The second use case presented was the rewriting of the balance service, which included a migration of a smaller amount of data, also with a simpler structure, from a Cassandra database to a MySQL database. This was undertaken because MySQL was a better fit for the querying characteristics of the corresponding service.
This migration was achieved by first creating a "data integrator" service as a data access layer between the application service and the underlying databases, which could effectively shadow writes from the primary Cassandra instance across to the new MySQL data store. The data integrator was also responsible for reporting metrics based on any unexpected results or data mismatches between the two writes.
The second part of the migration process involved asynchronously "hydrating" data from regular Cassandra snapshots into MySQL. The data integrator would attempt to read data from MySQL, and fallback to Cassandra (which was still the source of truth) if data had not yet been migrated. Writes continued to be shadowed to both datastores. Metrics of unexpected issues were again captured, and regular comparisons and reconciliations were made between the data stores. Any data with discrepancies was re-migrated from Cassandra to MySQL, with a complete rewrite of customer's data in MySQL even if there was a single issue.
This hydrate, shadow and compare process was continued until the required confidence of correctness had been achieved. The "final flip" in this migration simply involved making MySQL the single source of truth within the data integrator, and removing the underlying Cassandra data store.
The final migration use case that was discussed was the rewriting of the Netflix invoicing system. This required the creation of an entirely new path of functionality and corresponding service that would also require a new data store. This existing invoice data set was extremely large, and also had a high rate of change. The plan was to move the data from the existing MySQL database to a new Cassandra instance, while not impacting a customer's ability to interact with the invoicing functionality.
After analysis and several design discussions, the team realised that for this migration the corresponding data did not have to be "migrated" per se. The decision was taken to build the new service using the new Cassandra database, and run this in parallel with the "legacy" service and data store. By placing a "billing gateway" in front of the two services, a programmatic decision could be taken on which invoice service to call, based on whether a customer's invoicing data had been migrated or not.
Handa and her team "canary released" the new service by incrementally migrating small subsets of data from the old system to the new (for example, copying the invoice data of all users from a small country). Confidence in the resilience of the migration was built by testing in "shadow mode", where writes were shadowed across the two services and the results compared and reconciled at the data store leve. Extensive monitoring and alerting was used to catch the presence of unexpected behaviour and state. The effective "migration" of the data took over a year to achieve as part of the operation of the normal invoicing functionality, but there was zero downtime and no impact to the customer.
Summarising the talk, Handa shared the key lessons that her team had learned: aim for zero downtime even for data migration; invest in early feedback loops to build confidence; find ways to decouple customer interactions from your services; build independent checks and balances through automated data reconciliation; and canary release new functionality "rollout small, learn, repeat".
The slides of Sangeeta Handa's QCon SF talk, "Building Resilience in Netflix Production Migrations" (PDF), can be found on the conference website, and the video recording will be made available on InfoQ over the coming months.