Key Takeaways
- In addition to developing the system, multiple precautionary and mitigation mechanisms must be designed to execute large migrations.
- Developing a robust shadow and validation mechanism is essential for read and write APIs and events published to the message bus.
- Developing a mechanism to drain traffic to new or old systems with heavy business context is critical to avoid zero downtime.
- The backward compatibility layer significantly de-risks large migrations.
- The special nature of using a cloud database requires careful warm-up and redundancy to keep the migration smooth and reliable.
In large-scale distributed systems, migrating critical systems from one architecture to another is technically challenging and involves a delicate migration process. Uber operates one of the most intricate real-time fulfillment systems globally. This article will cover the techniques to migrate such a workload from on-prem to a hybrid cloud architecture with zero downtime and business impact.
System Complexity
Uber’s fulfillment system is real-time, consistent, and highly available. Users consistently engage with the app, initiating, canceling, and modifying trips.
Restaurants continually update order statuses while couriers navigate the city to deliver packages. The system executes over two million transactions per second to manage the states of our users and trips on the platform.
I was the engineering leader driving the last redesign of this fulfillment system from an on-prem system to a hybrid cloud architecture. Beyond the code and validation required to develop a new system, the most challenging part of the project was to design a migration strategy that involved zero downtime or customer impact.
Architecture of the Previous System
Let me simplify the previous fulfillment system. It consisted of a few services maintaining the state of different user and trip entities in memory. Additionally, there were supporting services to manage locking, searching, and storage systems to support cross-datacenter replication.
All rider transactions were routed to the "demand" service, and all driver transactions were routed to the "supply" service. The two systems were kept in sync with distributed transaction techniques.
These services operated as a single unit called a pod. A pod is a self-sufficient unit with many services interacting with each other to power fulfillment for a single city. Once a request enters a pod, it remains within the services in the pod unless it needs access to data from services outside the pod.
In the image below, the green and blue service groups represent two pods, with a replica of each service within a pod. Traffic from city 1 is routed to pod 1, and traffic from city 2 is routed to pod 2.
Furthermore, the system facilitated distributed transactions among services A, B, and C to ensure synchronization of entities stored within them, employing the saga pattern. Entity consistency was maintained within each service through in-memory data management and serialization.
At the same time, this system scaled in the early phase of Uber. It has some architectural choices that could have worked better with Uber’s growing demands. Firstly, The system was designed to be available rather than consistent. This meant that when multiple systems store different entities, the system is eventually consistent without a lack of a true acid-compliant system to make changes across them. Since all the data was maintained in memory, the system also inherently has limitations on vertical scaling and the number of nodes for a given service.
Redesigned System
The new system consisted of fewer services. The services that stored entities like "demand" and "supply" were consolidated into a single application backed by a cloud datastore. All the transactional responsibilities were moved down into the datastore layer.
New System and Existing Service Consumers
A multi-dimensional strategy covering various aspects of migration is needed for such critical migrations. In microservices environments, every system interacts with many systems around it. The two primary ways of interaction are APIs and events published to a message bus by the services.
The new system modified all the core data models, API contracts, and event schemas. With 100s of API callers and consumers of the system, it is not an option to migrate them all simultaneously. We adopted the strategy of a "backward compatibility" layer that maintained existing API and events contracts.
Creating a backward compatibility layer allows a system to rearchitect while all consumers consume the old interfaces. Over the next few years, consumers could gradually move to new APIs and events without coupling them to this redesign. Similarly, any consumer of events from the old system would continue receiving those events in the old schema published by the backward compatibility layer.
Lifecycle of Entities on the Platform
What makes this transition complex is that the entities are continuously changing states. If we consider the three entities in the system - a rider, a driver, and a trip - they can start and stop at different points in time. The previous system had these entities in different in-memory systems, and the new system has to reflect it all in a database.
The migration challenge is to ensure that every entity state and its relationships are preserved in this transition while actively changing at a high rate.
Migration Strategy
Multiple strategies must be adopted at each stage - pre-rollout, during rollout, and post-rollout—to move traffic from the existing system to the new one. I will go over each in detail in the following section of the article.
Before Rollout
Shadow validation
Having a system to validate the API contracts between the old and the new system is critical to gaining confidence that the new system is in parity with the existing system. The backward compatibility layer provides a key functionality of validating API and events parity between the old and the new system.
Every request is sent to both the old and the new system. The responses from both systems are compared per key and value, and the differences are published into an observability system. Before each launch, the goal is to ensure that there are no differences between the two responses.
There are some nuances in a real-time system. In a transient system, not all calls will succeed with exact responses. One strategy is to review all successful calls as a cohort with high match rates and ensure that most API calls are successful. There are valid situations where calls to the new system will be unsuccessful. When a driver requests an API to go offline on their app, but the new system has no prior record of the user going online, it will generate a client error. In these edge cases, the parity mismatch can be ignored but with caution.
Further, read-only API parity is the easiest because there are no side effects. On write APIs, we introduced a system where the old system would record the request responses in a shared cache, and the new system would replay the response. Since we have a trace header, the new system will transparently pick up the responses originally received by the old system and put them into the new system from the cache instead of making a full external call to the outside system. This allows more consistency in responses and fields from dependent systems, allowing us to better match the new and the old systems without interferences from external dependencies.
E2E Integration testing
Large rewrites are the right moments to improve end-to-end (E2E) test coverage. Doing so will protect the new system as it scales its traffic and new code routinely gets launched in the new system. During this migration, we ramped up our tests to over 300 tests.
E2E tests should be run at multiple points of a development lifecycle: when an engineer creates code, during pre-deploy validation of the build, and continuously in production.
Blackbox testing in canary
Preventing bad code in critical systems can be achieved by exposing the code only to a small subset of the requests, reducing the blast radius. To further improve, it can expose the code only to a pre-production environment with continuously running E2E tests. Only if the canary environment passes successfully can the deploying system proceed further.
Load testing
A comprehensive load test must be undertaken before a significant production load is placed on the new system. With our ability to redirect traffic to the new system via a backward-compatible layer, we can enable full production traffic to the new system. Custom load tests were written to validate unique integrations like the database network systems.
Warming up database
Database or compute scaling with cloud providers is not instantaneous. When large loads are added, the scaling system does not catch up quickly enough. Data re-balancing, compaction, and partition splitting in distributed data stores can cause hotspots. We simulated synthetic data load against these systems to achieve a certain level of cache warming and splits. This ensured that when production traffic was pointed at these systems, they operated without the sudden scaling needs of the system.
Fallback network routes
Since the application and the database systems are across two different cloud providers, one on-prem and another in the cloud, special attention has to be paid to managing network topology. The two data centers had at least three network routes established to connect to the database across the internet through dedicated network backbones. Validating and load testing these networks at a capacity that is over 3X was critical to ensure their reliability in production.
During Rollout
Traffic Pinning
Multiple apps are interacting with each other on the same trip in a marketplace. A driver’s and rider’s trips represent the same trip entity. When moving an API call from a rider to the new system, the corresponding trip and driver’s states can’t be left behind in the old system. We implemented routing logic that enforces trips to continue and end in the same system where the trip started. To make this determination, we start recording all our consumer identifiers around 30 minutes before executing a migration. The requests for corresponding entities continue to be pinned to the same system.
Rolling out in phases
The first set of trips migrated for a given city are the E2E tests. Next, idle riders and drivers not on an active trip are migrated to the new system. After this migration, any new trips were initiated only into the new system. This duration has two systems serving two marketplaces in parallel. As ongoing trips are completed in the old system, the riders and drivers who become idle are progressively migrated to the new system. Within an hour, the entire city is drained to the new system. In the event of an outage in the new system, we can execute the same sequence in reverse.
After Rollout
Observability and rollback
Robust observability across both systems is critical while executing a switch between two systems. We developed elaborate dashboards that show traffic draining from one system while the traffic picks up on the new system.
Our goal was to ensure that the key business metrics of trip volume, available supply, trip completion rates, and trip start rates remained unchanged throughout migration. The aggregate metrics should remain flat, while the mix of old vs new systems would change at the moment of migration.
Since Uber operates in thousands of cities with numerous functionalities, it is crucial to observe these metrics at the granularity of a city. It was not possible to observe all the metrics manually when migrating 100s of cities. Specialized tools must be built to observe these metrics and highlight the most deviant ones. Alerting was also futile since a surge of alerts improperly tuned to different city traffic would not help with confidence in migration. A static tool that allowed an operator to analyze the health of cities was crucial in ensuring that all cities were operating well.
If one city seemed unhealthy, the operator could roll just the single city back to the old system while leaving all the other cities in the new one.
Blackbox tests in production
Similar to the "Blackbox testing in canary" strategy, the same tests should be run against production systems continuously to discover problems.
Executing Successful Migrations
It is crucial to have enough runway in the old system so it does not distract us from developing the new system. We spent four months hardening the reliability of the old system before starting development on the new system. Planning across projects is vital such that features are initially developed in the old system, but mid-way through the development of the new system, features are dual developed in the old and the new system in parallel, finally only in the new system towards later stages of the new system. This juggling of stakeholders is extremely critical to avoid being stuck maintaining two systems eternally.
When migrating systems of absolute criticality, which need strong consistency guarantees and are primary to business uptime, almost 40% of the engineering effort should go into observability, migration tooling, and robust rollback systems when building the new stack.
Vital technical elements involve complete control of API traffic at an individual user level, strong field-level parity and shadow-matching machinery, and robust observability across the systems. In addition, the moment of migration is a key engineering challenge fraught with race conditions and special handling that may require explicit business logic or product development.
Traffic migration to the new system typically remains flat, with slow growth in the initial iterations. Once there is confidence in the new system, it usually accelerates exponentially, as shown in the image below.
The challenge of accelerating earlier is that you are stuck with maintaining two systems with equally important traffic volumes for longer periods. Any outages occurring due to scale in the new system will have an outsized impact on the production users. It is best to validate all features against the least number for an extended period before scaling traffic migration to the new system in a relatively shorter duration.
At Uber, we executed this migration with less than a few thousand users impacted throughout the development and migration. The techniques discussed were cornerstone methods to help with a smooth transition from our system to a brand-new fulfillment system.