BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles How Cell-Based Architecture Enhances Modern Distributed Systems

How Cell-Based Architecture Enhances Modern Distributed Systems

Key Takeaways

  • Cell-based architectures increase the resilience of systems by reducing the blast radius of failures.
  • Cell-based architectures are a good option for systems where any downtime is considered unacceptable or can significantly impact end users.
  • Cell-based architectures augment the scalability model of microservices by forcing fixed-size cells as deployment units and favoring a scale-out rather than a scale-up approach.
  • Cell-based architectures make clearer where various components (which could be microservices) fit in the context of a wider system as they are packaged and deployed as a cell rather than on the granular level of application service.
  • Cell-based architectures help improve the security of distributed systems by applying an additional level of security around cells.
This article is part of the "Cell-Based Architectures: How to Build Scalable and Resilient Systems" article series. In this series we present a journey of discovery and provide a comprehensive overview and in-depth analysis of many key aspects of cell-based architectures, as well as practical advice for applying this approach to existing and new architectures.

The ability to accommodate growth (or scale) is one of the main challenges we face as software developers. Whether you work in a tiny start-up or a large enterprise company, the question of how the system should reliably handle the ever-increasing load inevitably arises when evaluating how to deliver a new product or feature.

The challenges of building and operating modern distributed systems only increase with scale and complexity. Infrastructure resources, in the cloud or on-premises, can experience unexpected and difficult-to-troubleshoot failures that architecture components need to deal with to deliver the required availability.

Monoliths, Microservices, and Resiliency Challenges

Several years ago, microservices and their associated architectures became popular as they helped address some of the scaling challenges that monolithic applications (monorepos) faced.

As Susan Fowler mentioned in an interview with InfoQ several years ago, these applications might not support sufficient concurrency or partitioning and, therefore, would reach scalability limitations that lead to performance and stability problems. As these monolithic applications grew, they became more challenging to work on in local environments. Application deployments became increasingly complex, resulting in teams whose developer velocities would grind to a crawl.

Microservices helped ease those problems by enabling teams to work on, deploy, and scale the services in isolation. However, as with most things, nothing is without flaws, and microservices have their own challenges.

One is that a microservice architecture is very granular down to the level of individual services. As a result, development teams would lack knowledge of where various microservices under their ownership were used in the context of the wider system. It would also be more challenging to know what microservices exist under the ownership of other teams that would be of interest.

These challenges become only more prominent over time as microservices architectures become more complex. Additionally, with the widespread adoption of cloud infrastructure, many companies now manage vast estates of cloud resources, ranging from computing to storage to networking and support services. Any of these resources can experience failures that may result in a minor or significant degradation of service and despite using redundancy and failover mechanisms, some failure modes cannot be fully contained without adopting special measures.

The Re-emergence of Cell-Based Architecture

Challenges related to fault isolation are not new and are not specific to microservices or the cloud. As soon as software systems became distributed to accommodate increasing load requirements, many new failure modes had to be considered due to their distributed nature.

Cell-based architectures first emerged during the Service-Oriented Architecture (SOA) era as an attempt to manage failures within large distributed systems and prevent them from affecting the availability of the entire system. These initial implementations by companies like Tumblr, Flickr, Salesforce, or Facebook aimed to limit the blast radius of failures to only a fraction of the customer or user population (a shard) using a self-contained cell as a unit of parallelization for managing infrastructure and application resources and isolating failures.

Cell-based architecture is, firstly, an implementation of the bulkhead pattern, an idea that software engineering adopted from the shipbuilding industry. Bulkheads are watertight vertical partitions in the ship structure that prevent water from flooding the entire ship in case of a hull breach.

Bulkheads Are Protecting Ships From The Spread of Flooding

For years, the bulkhead pattern has been advertised as one of the key resiliency patterns for modern architectures, particularly microservices. Yet, adoption has been low, mostly due to additional complexity, so most companies choose to prioritize their efforts elsewhere.

Some high-profile companies have recently opted to revisit the cell-based architecture approach to meet the high availability requirements for their microservice-based, cloud-hosted platforms. Slack has migrated most of its critical user-facing services to use a cell-based approach after experiencing partial outages due to AWS availability-zone networking failures. Doordash implemented the zone-aware routing with its Envoy-based service mesh, moving to an AZ-based cell architecture and reducing cross-AZ data transfer costs. In turn, Roblox is rearranging its infrastructure into cells to improve efficiency and resiliency as it continues to scale.

What these companies have in common is that they run microservice architectures on top of large infrastructure estates in the cloud or private data centers, and they have experienced severe outages due to an unlimited blast radius of infrastructure or application failures. In response, they adopted a cell-based architecture to prevent failures from causing widespread outages.

Amazon Web Services (AWS) has been a long-time adopter and evangelist of cell-based architecture and covered it during its annual re:Invent conference in 2018 and again in 2022. The company also published a whitepaper on cell-based architecture in September 2023.

Building Blocks of Cell-Based Architecture

At a high level, a cell-based architecture consists of the following elements:

  • cells - self-contained infrastructure/application stacks that provide fault boundaries; responsible for handling application workloads
  • control plane - responsible for provisioning resources, deploying application services, determining routing mappings, providing platform observability, moving/migrating data, etc.
  • data plane - responsible for routing the traffic appropriately based on data placement and the cell health (as determined by the control plane)

To provide fault-tolerance benefits, cell-based architecture is oriented towards supporting cell-level isolation and low coupling between the control plane and the data plane. It’s important to ensure that the data plane can operate without the control plane and should not be directly dependent on the health of the control plane.

Cell as a First-Class Architecture Construct

Adopting cell-based architecture offers an interesting blend of benefits. Cells, first and foremost, provide a fault boundary at the infrastructure level, with cell instances designated to serve a specific segment of the traffic, isolating failures to a subset of a user or customer population. However, they also offer an opportunity to group related application services into domain-specific clusters, aiding with architectural and organizational structures, promoting high cohesion and low coupling, and reducing the cognitive load on the engineering teams.

For small systems or when starting the cell-based architecture adoption effort, it’s entirely possible to have a single cell that includes all application services. For larger systems with many application services, multiple cells can be used to organize the architecture along the domain boundaries. Such an approach can help larger organizations adopt the product mindset and align the system architecture with product domains and subdomains. This is particularly important with large microservice systems consisting of hundreds of microservices.

Cell-Based Architecture Combines Domain and Fault-Isolation Boundaries

From the fault-tolerance point of view, a cell (or a cell instance) is a complete, independent infrastructure stack that includes all the resources and application service instances required for it to function and serve the workload for the designated segment of traffic (as determined by the cell partitioning strategy). It is crucial to isolate cells as much as possible to keep failures contained. Ideally, cells should be independent of other cells and not share any state or have shared dependencies like databases. Any inter-cell communication should be kept to a minimum; ideally, synchronous API calls should be avoided. Instead, asynchronous, message-driven data synchronization should be used. If API interactions can’t be avoided, they must go through the cell router so that fault-isolation properties of cell-based architecture are not compromised.

Many considerations regarding cell deployment options include choosing single or multi-DC (data center) deployments and settling on the most optimal cell size. Some organizations adopted cell-based architecture with single DC deployment, where all infrastructure and application resources of a cell instance are co-located in a single data center or availability zone. Such an approach minimizes the impact of gray failures with multi-DC deployments and simplifies health monitoring (cell is either healthy or not). On the other hand, multi-DC deployments, when used appropriately, can provide resiliency in case of DC-level failures, but health monitoring becomes more challenging.

Cell sizing can also play an important role in managing the impact of failures and managing infrastructure costs. Using smaller cells can reduce the scope of impact (fewer users/customers affected), improve resource utilization (fewer idle resources due to higher cell occupancy levels), and limit the work required to re-route the traffic segment to other cells. However, if the cell size is too small, it may pose challenges in servicing exceptionally large clients/customers, so cells should be big enough to cater to the largest traffic segment based on the partitioning key.

On the other hand, the bigger the cells, the greater the economy of scale in terms of resources, meaning better capacity utilization. Managing fewer cell numbers may be easier on the operational team. Additionally, with larger cell sizes, care needs to be taken to account for infrastructure limits, such as region and account level limits for cloud provider platforms.

Control Plane For Managing Cell-Based Architecture

Adopting the cell-based architecture requires a substantial effort to develop management capabilities that surpass those needed to support regular microservice architecture. Beyond the provisioning and deployment of infrastructure and application services, cell-based architectures require additional functions dedicated to managing and monitoring cells, partitioning and placing the traffic among available cells, and migrating data between cells.

The primary consideration for the cell-based architecture is how to partition traffic between cells, which should be determined separately for each domain using the cell-oriented approach. The first step in working out the optimal partitioning scheme is to choose the partitioning key. In most cases, this may end up being a user or customer identifier, but the choice should be made individually for each case, considering the granularity of traffic segments to avoid segments larger than the chosen cell capacity.

Cell Partitioning Can Use Different Mapping Methods

There are many methods to implement the mapping for cell partitioning, with their respective advantages and disadvantages. These methods range from full mapping, where all mapping records are stored, to using consistent hashing algorithms that offer a fairly stable allocation of items to buckets and minimize churn when adding and removing buckets. Irrespective of the selected mapping approach, it’s helpful to provide the override capability to enable special treatment for some partition keys and aid testing activities.

The secondary consideration is the cell placement strategy when new users/customers are onboarded or new cells are provisioned. The strategy should take into account the size and availability capacity of each cell and any cloud provider quotas/limits that can come into play. When the cell-capacity threshold is reached, and a new cell is needed to accommodate the amount of traffic coming to the platform, the control plane is responsible for provisioning the new cell and updating the cell mapping configuration that determines the routing of application traffic by the data plane.

Related to the above is data migration capability, which is important for cell placement (if partition reshuffle is required) or during incidents (if a cell becomes unhealthy and needs to be drained). By their very nature, data migrations are quite challenging from the technical point of view, so this capability is one of the most difficult aspects of delivering cell-based architecture. Conversely, migrating or synching the underlying data between data stores in different cells opens up new possibilities regarding data redundancy and failover, further improving the resiliency offered by adopting cell-based architecture.

Data Plane For Routing Application Traffic

As much as the control plane is responsible for managing the architecture, the data plane reliably moves traffic data around. In the context of cell-based architecture, that translates to routing the traffic to appropriate cells, as determined by the partition mapping records. It’s important to emphasize that the routing layer needs to be as simple and horizontally scalable as possible, and complex business logic should be avoided as the data plane is a single point of failure.

The routing layer implementation can employ solutions ranging from DNS and API Gateways to bespoke application services deployed on generic compute or container-based execution platforms. In either case, the partition mapping data has to be available to read from a reliable data store, possibly a highly available distributed database or a blob storage service. The routing layer can support synchronous API calls (HTTP or GRPC) and asynchronous messages, although the latter can be more challenging to implement.

Cell Router As a Primary Data Plane Component

Considering its crucial role in the flow of traffic between cells, the data plane can enforce security policies to ensure that only authorized API requests are served by the services inside cells. As such, a range of security mechanisms can be implemented to protect from unauthorized access, including OAuth or JWT, mutual TLS for authentication, and RBAC or ABAC for authorization.

Benefits of Using Cell-Based Architecture

The primary benefit of adopting cell-based architectures is improved resiliency through fault isolation. Cells provide failure-isolation boundaries and reduce the impact of issues, such as deployment failures, clients abusing the product/platform, operator’s mistakes, or data corruption.

Using cells can also help the system's scalability. Ideally, cells should be limited in size to reduce the blast radius of failures, which also makes cells good as a unit for scaling the platform. As the workload increases over time, more cells can be provisioned to cater to the new traffic (new customers/users). Limiting the cell size reduces the risk of surprises from any non-linear scaling factors or unexpected contention points (performance bottlenecks).

Similarly, cells can be used for deployment scoping. Rather than rolling out a new version of service everywhere, organizations could use canary deployments scoped to a cell (and hence a subset of users/customers) before rolling out the change to a wider user/customer population.

The size-capped cells can be perfect for quantifying the system’s performance, as it’s easier to test the performance of a single cell and establish the system’s scalability characteristics based on scaling out whole cells rather than scaling up components inside cells.

Cells provide the added benefit of grouping services belonging to the same subdomain or bounded context, which can help organizations align team and department boundaries with product domain boundaries. This is particularly relevant for large organizations, where tens or hundreds of teams build and operate large product portfolios.

The last potential benefit could be cost savings from reducing cross-AZ traffic, but this should be weighed against any additional operational costs related to running the routing layer within the data plane.

Considerations for Adopting Cell-Based Architecture

While cell-based architectures offer many advantages in the context of distributed systems, implementing this approach requires additional effort and introduces challenges, so it may not be best suited for every organization like startups still iterating on product-market fit to invest in. Like microservice architectures, cell-based ones require a significant investment in the underlying platform in order to have this architecture speed up your teams’ velocity rather than hinder it.

Considering that most companies with non-trivial infrastructure footprint are likely to face challenges that prompted others to adopt cell-based architecture in the past, it may still be worth assessing whether a cell-based approach is worth pursuing.

First and foremost, any company that simply cannot afford widespread outages due to reputational, financial, or contractual requirements should strongly consider adopting cell-based architecture, if not for all, at least for critical user-facing services.

Furthermore, any system where a low Recovery Point Objective (RPO) or Recovery Time Objective (RTO) is required or desired should also consider a cell-based approach. Lastly, multi-tenant products requiring strict infrastructure-level isolation at the tenant level can benefit from cell-based architecture to provide fully dedicated tenancy capabilities.

In any case, the total cost of adopting cell-based architecture should be considered and balanced against expected benefits to determine the anticipated return on investment.

This article is part of the "Cell-Based Architectures: How to Build Scalable and Resilient Systems" article series. In this series we present a journey of discovery and provide a comprehensive overview and in-depth analysis of many key aspects of cell-based architectures, as well as practical advice for applying this approach to existing and new architectures.

About the Authors

Rate this Article

Adoption
Style

BT