Uber recently shared how it re-architected its fulfilment service, one of Uber's foundational platform services. Following a two-year-long effort involving 30+ teams and hundreds of developers, Uber engineers say that they "built a strong foundation for modelling various types of physical fulfilment categories in the new platform and migrated all existing transportation use cases." The platform handles more than a million concurrent users and billions of trips per year across over ten thousand cities.
Fulfilment is the "act or process of delivering a product or service to a customer." Uber engineers describe the reasons for re-architecting the fulfilment service:
The last major rewrite of the Fulfillment Platform was in 2014, when Uber's scale was much smaller, and the fulfilment scenarios were simpler. Since then, Uber's business has evolved to include many new verticals across Mobility and Delivery, handling a diverse set of fulfilment experiences. Some examples include reservation flows where a driver is confirmed upfront, batching flows with multiple trips offered simultaneously, virtual queue mechanisms at Airports, the three-sided marketplace for Uber Eats, delivering packages through Uber Direct, and more.
The team spent six months preparing for the move, "carefully auditing every product in the stack, gathering 200+ pages of requirements from stakeholder teams, extensively debating architectural options with tens of evaluation criteria, benchmarking database choices, and prototyping application frameworks options." Essential requirements from the new architecture were high availability with at least 99.99% adherence to SLA, strong consistency even for transactions across regions, and a solid extensibility model for Uber developers to use.
Uber decided to leverage NewSQL to satisfy the requirements of transactional consistency and horizontal scalability. NewSQL is a class of relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system. This class of databases includes databases such as CockroachDB and FoundationDB. Uber chose to use Google Spanner as its primary storage engine, following an extensive benchmark effort.
Source: https://eng.uber.com/fulfillment-platform-rearchitecture/
To support post-commit operations using the change data capture technique, Uber implemented an in-house component called Latent Asynchronous Task Execution (LATE). LATE commits all post-commit operations and timers along with the read-write transaction to a separate LATE action table. Dedicated LATE workers scan and pick up rows from this table and guarantee at-least-once execution.
Extensibility by internal engineers is a crucial aspect of the new architecture. To support it, Uber engineers created a three-part programming model. First, they use Statecharts to model fulfilment entities while ensuring consistent and modular behaviour modelling. According to Uber engineers, they "formalized and documented the principles of modelling a fulfilment entity by leveraging a consistent data modelling approach with Protobufs and building a generic Java framework for implementing statecharts."
Source: https://eng.uber.com/fulfillment-platform-rearchitecture/
Then, Business Transaction Coordinators handle writes across multiple entities to be modular and leveraged in different product flows. The business transaction coordinator takes a directed acyclic graph of entity triggers as input and orchestrates through the nodes, representing single-entity triggers, in the graph within the scope of a single read-write transaction. Lastly, an ORM Layer provides database abstractions, simplicity, and correctness.
The team put significant effort into designing the migration of existing service users to the new service while maintaining backward and forward compatibility. Some of these efforts resulted in strenuous test frameworks for the new service. Uber engineers describe:
We built new, end-to-end integration tests for 200+ product flows and back-tested them on the existing stack to validate the correctness of the new stack relative to the old stack. We built a sophisticated, session-based shadow framework to compare the requests and responses between the old and new stacks. We created test cities that matched configurations of production cities to smoke test during development. All of these strategies ensured that we caught as many issues as possible and minimized the unknown unknowns.
Uber had several motivations to embark on the re-architecture journey. First and foremost, Uber built the previous architecture on the premise that it should trade off consistency for availability and latency. Hence, it only achieved consistency via a best-effort mechanism. In reality, system inconsistencies occurred more than expected due to various split-brain situations (happening during deploys, region failovers). Also, multi-entity state changes were not transactional, resulting in an inconsistent state at any point in time. This behaviour, in turn, made debugging and reasoning about the system much harder.
Other reasons for re-architecting the system included scalability concerns and a desire to change the tech stack. The old service was written with Node.js and HTTP/JSON, while newer Uber services use Java with Protocol Buffers (Protobuf) for communication.