BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Roblox Builds New Cellular Infrastructure to Improve Gaming Experience

Roblox Builds New Cellular Infrastructure to Improve Gaming Experience

The online game platform and creation system Roblox has detailed how they have made their infrastructure more efficient and resilient to support the demands of more than 70 million active daily users engaged in immersive gaming experiences. A blog post delves into Roblox's commitment to reliability, its response to a major 2021 outage, and the ongoing transformation of its infrastructure to enhance efficiency and resilience.

In October 2021, Roblox faced a system-wide outage lasting 73 hours, triggered by a small issue in one data center which rapidly escalated into a large-scale outage. Post-incident analysis prompted the team to intensify efforts in fortifying its infrastructure against various failure factors, such as traffic spikes, weather conditions, hardware failure, software bugs, and human errors. The focus has been on preventing issues in one component from spreading to the entire system, and also to ensure that the network or a user persistently retrying an action does not create load-related cascading failures.

In an immediate response to the October 2021 outage, Roblox initially built a copy of its infrastructure in a new data center in a different region, in an active-passive style. This meant that the team could fail over the entire system onto the backup infrastructure in the event of a significant failure in the primary data center. This provided a form of much-needed resilience, but the longer-term goal is to transition from an active-passive data center setup to an active-active configuration, where two data centers handle workloads simultaneously. The objective is to achieve higher reliability and near-instantaneous failover.

Roblox is also implementing a cellular infrastructure to establish robust "blast walls" within data centers, preventing the failure of an entire data center. Cells, or clusters of machines, offer redundancy and containment of failures within a single cell. Roblox aims to migrate all services to cells for enhanced resilience and efficient workload management, whereby entire cells (each of which could contain 1,400 servers) can be repaired or completely reprovisioned if necessary. The process involves ensuring uniformity, requiring services to be containerized, and adopting an infrastructure-as-code philosophy for cells. Roblox's new deployment tool automatically ensures that services are striped across cells, relieving service owners from thinking about replication.

Roblox compares cells to fire doors, containing failures within a single cell. The goal is to make cells interchangeable, allowing faster recovery in case of issues. However, managing cross-communication between cells presents challenges, as a core requirement is to prevent the 'query of death' where the retries for a query cause a cascading failure. The platform is implementing short-term solutions like deploying copies of computing services to each compute cell and load-balancing traffic across cells to mitigate this.

Long-term plans include a next-generation service mesh for service discovery and a method for directing dependent requests to a service in the same cell as the original caller. This will lower the risk of failures spreading from one cell to another. 70% of the back-end traffic is now served by cells, aiming to reach 100%. Close to 30,000 servers are running the cells, which is less than 10% of the total server estate.

The complexity of migrating a busy "always on" platform without disrupting users is significant. Without significant capital expenditure to buy all-new servers to run the cellular infrastructure, Roblox creatively used a small pool of spare machines, strategically built new cells to migrate workloads gradually, and then re-used the machines, now freed up for the next migration. This caused some desirable fragmentation of cells across different data center halls, increasing resilience within a cell. The company anticipates completing the migration by 2025, emphasizing the need for robust tooling to deploy balanced services without disrupting users and thorough testing to ensure compatibility of new services running in the cellular architecture.

Roblox's Infrastructure Modernization Plan

Roblox's early efforts are proving successful, but the work on cells is ongoing. The platform is committed to improving efficiency and resiliency as it continues to scale. Key achievements include building a second data center, creating cells in active and passive data centers, migrating over 70 percent of back-end service traffic to cells, and establishing requirements for implementing uniformity. In September 2023, Roblox initiated active-active experiments across data centers to enhance reliability and minimize failover times. Successful outcomes have led to plans for a full active-active infrastructure, identifying system design patterns for improvement. The platform remains excited about driving efficiency and resiliency, envisioning itself as a reliable, high-performing utility for millions of users and aiming to connect a billion people in real-time.

The infrastructure now runs on nearly 145,000 servers -- a three-fold increase in two years -- and runs primarily on premises on a private hybrid cloud. In conclusion, Roblox's current efforts to transform its infrastructure to make the platform more resilient and efficient for millions of users, setting the stage for continued growth and innovation.

About the Author

Rate this Article

Adoption
Style

BT