BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Scaling Uber’s Batch Data Platform: a Journey to the Cloud with Data Mesh Principles

Scaling Uber’s Batch Data Platform: a Journey to the Cloud with Data Mesh Principles

This item in japanese

Some months ago, Uber started a migration of its batch data analytics and machine learning platform to Google Cloud Platform (GCP). In a recent post on its engineering blog, Uber provided additional information regarding its batch data cloud migration, which incorporated crucial data mesh principles.

Uber’s batch data platform is a crucial element of its data infrastructure, supporting over 10,000 internal users, including data scientists, engineers, city operations staff, and business analysts. This system manages approximately 1.5 exabytes of data stored in Apache Hadoop Distributed File System (HDFS) across two on-premises regions, handling over 500,000 Presto queries and 370,000 Apache Spark applications daily. To enhance scalability and streamline operations, Uber is migrating its batch data platform to Google Cloud GCP. The transition leverages Google Cloud Storage (GCS) for Uber’s data lake while moving the rest of the infrastructure to cloud-based Infrastructure as a Service (IaaS).

The cloud migration was not without its challenges, particularly related to the storage and Identity and Access Management (IAM) limits imposed by cloud providers. One of Uber’s primary considerations is to efficiently map HDFS files to GCS buckets without overwhelming or underutilizing the available resources. Additionally, Uber must apply access controls appropriately within the storage hierarchy, ensuring the system remains secure without unnecessarily elevating user privileges. This migration also presents an opportunity to enhance the system by consolidating security groups and decentralizing data ownership. Under the new model, data belonging to specific organizations is stored in organization-specific buckets, allowing data ownership and access control to be more clearly delineated.

Security and governance have been central concerns throughout the migration. The goal is to map data based on its intended usage and lifecycle, categorizing it to ensure appropriate access controls are applied. Critical datasets that are used widely within the organization are stored in dedicated buckets with open access, while less critical data is stored separately with restricted access and lifecycle management policies. Additionally, the migration enables automated infrastructure setup, facilitating quicker provisioning of testing, staging, and production environments. This automation helps speed up new data analytics use cases and ensures multiple regions are ready for disaster recovery scenarios.

To facilitate this massive shift, Uber developed a service called DataMesh, which is designed to abstract and manage the cloud infrastructure. DataMesh organizes data resources around the principles of data mesh, focusing on decentralized data ownership and domain-specific control. This service automates the reconciliation of data with cloud resources, pulling information from Uber’s internal repositories to ensure that data is correctly labeled, secured, and monitored. The DataMesh platform also manages the mapping of HDFS paths to their corresponding cloud-based paths, making the migration as seamless as possible for users and preventing disruption to existing workflows.

 

A logical view of the DataMesh components and the hierarchy.

One significant challenge that Uber has faced during the migration process is the need to accommodate changes in data ownership and the limits set by GCS. Data ownership changes can occur due to team reorganizations or users reassigning assets. To address this, Uber implemented an automated process to monitor and reassign ownership when necessary, ensuring data remains securely stored and managed. Additionally, Uber optimized its data distribution to avoid hitting GCS storage limits, ensuring that heavily used tables are separated into their buckets to improve performance and make monitoring easier.

Other examples of data mesh implementation in the real world are:

  • Gilead Sciences: a biopharmaceutical company that developed a data mesh architecture. The implementation aimed to create a new organizational and operational model backed by data, enabling Gilead to engage in a cloud transformation initiative. The data mesh approach allowed Gilead to manage data as a product and adopt a cloud-first architecture.
  • Saxo Bank: a financial services company, that implemented data mesh into a project aimed to decentralize data ownership and governance, enabling domain teams to manage their data products and provide real-time insights to the business.

Looking to the future, Uber aims to further expand its use of data mesh principles by building a platform that allows for self-governed data domains. This will simplify infrastructure management and enhance data governance, ultimately creating a more agile, secure, and cost-efficient data ecosystem. The cloud migration of Uber’s batch data platform is a significant undertaking, but through careful planning and the development of innovative tools like DataMesh, Uber is positioning itself for greater scalability, security, and operational efficiency in the cloud.

About the Author

BT