Key Takeaways
- Cell-based architectures increase the resilience of systems by reducing the blast radius of failures.
- Cell-based architectures are a good option for systems where downtime is unacceptable or can negatively impact end-users.
- Cell-based architecture can be complicated, and there are best practices that can be followed to improve the chances of success.
- There are practical steps to consider when rolling out the cell-based architecture or adapting/transforming the existing cloud-native/microservices architecture to become cell-based.
- Cells are not an alternative to microservices but an approach to help manage them at scale. Many of the best practices, problems, and practical steps for microservices also apply to cells.
This article is part of the "Cell-Based Architectures: How to Build Scalable and Resilient Systems" article series. In this series we present a journey of discovery and provide a comprehensive overview and in-depth analysis of many key aspects of cell-based architectures, as well as practical advice for applying this approach to existing and new architectures. |
Everything fails all the time, and cell-based architecture can be a good way to accept those failures, isolate them, and keep the overall system running reliably. However, this architecture can be complex to design and implement. This article explores the best practices, problems, and adoption guidelines organizations can use to succeed.
Cell-based Architecture Best Practices
Organizations should consider several best practices when adopting cell-based architectures to improve the manageability and resilience of the systems.
Consider The Use Case
Cell-based architecture can be more complex to build and run and cost more. Not every system needs to work at the scale and reliability of S3; consider the use case and whether it’s worth the additional investment. It’s a good fit for systems which:
- Need high availability.
- Scale massively so cascading failures must be avoided.
- Need a very low RTO (Recovery Time Objective).
- Are so complex that automated test coverage is insufficient to cover all test cases.
Also, consider the size of the system. For some organizations, each cell represents an entire stack: each service is deployed in each cell, and the cells don’t communicate with each other (DoorDash, Slack). For others, each cell is its own bounded business context, and the system comprises multiple layers of cells communicating with each other (WSO2, Uber’s DOMA). The latter may be more flexible but is undoubtedly more complex.
Make Cell Ownership Clear
If multiple cell layers communicate with each other, ideally, each cell would be owned by a single team empowered to build and deliver the cell’s functionality to production.
Consider making the boundary of a cell "team-sized" to make ownership easier to establish and help the team evolve the system as required by the business. Techniques like Domain-Driven Design and Event Storming can help find these boundaries.
Isolate Cells
Cells should be isolated from each other as much as possible to minimize the blast radius for both reliability and security problems. This isn’t always possible in the real world, but sharing resources should be done with care because it can significantly reduce the benefits of using cells in the first place.
On AWS, a good way to ensure isolation is to use a separate account per cell. Many accounts can be problematic to manage, but they provide excellent blast radius protection by default because you must explicitly allow cross-account access to data and resources.
It’s important to consider whether a single cell should be within a single availability zone or have its services replicated in multiple availability zones to take advantage of the physical isolation that Availability Zones offer. There is a tradeoff to be made here.
Single AZ
In a single AZ design, each cell runs in a single availability zone:
The main advantage of this approach is that an AZ failure can be detected, and action can be taken to handle it, such as routing all requests to other zones.
The disadvantages are:
- Recovery can be complicated by having to replicate cell contents to another AZ, which may break the isolation properties of the cell design.
- Depending on the router design, clients may need to be aware of the zone-specific endpoints.
Multiple AZs
In a multi-AZ design, each cell runs across two or more availability zones:
The significant advantage of multi-AZ is using regional cloud resources (like Amazon DynamoDB) to make cells more resilient if a single zone fails.
The disadvantages are:
- Gray failures can occur when a service experiences problems in only one AZ, making it difficult to exclude just that AZ from a given cell.
- Also, there may be extra cross-AZ data transfer costs. DoorDash used monitoring and a service mesh with AZ-aware routing to optimize costs by keeping traffic within the same AZ where possible.
Cell Failover
What happens if the AZ becomes unavailable in a single-AZ design? Where will the affected user requests be routed to?
One answer is not to handle failover at all: cells are designed to isolate faults. The fault will have to be rectified before the affected cells return to use.
The other option is to use a disaster recovery strategy to replicate cell data to another cell in a different AZ and start routing requests to the new cell. The risk here is that replication might reduce the cells' isolation. The replication process will depend on the data requirements and underlying data stores (regional cloud services can help here: see Leverage High Availability Cloud Service).
Automate Deployments
Just like with microservices, to run cells at scale, you need the ability to deploy them in hours and preferably minutes - not days. Quick deployments require a standardized, automated way of managing cells, which is vital and depends on an investment in tooling, monitoring, and processes.
Standardization does not mean that every team needs to use the same language, database, or technologies. Still, a well-understood and standard way to package and deploy applications to new or existing cells should exist. Ideally, the provisioning/deployment pipelines should allow teams to:
- Create new cells.
- Monitor their health.
- Deploy updated code to them.
- Monitor the status of a deployment.
- Throttle and scale cells.
The deployment pipeline should reduce the complexity and cognitive load for platform users - exactly how this looks will depend a lot on the size and tech stack of the organization.
Make Routing Reliable
The router above the cells is arguably the most critical part of the system: without it, nothing else works, and it can become a single point of failure. It’s essential to design it to be as simple as possible, so there are a few things to consider:
- The technology: DNS, API Gateway, a custom service. Each has its own advantages and disadvantages (for example, managing time-to-live for DNS).
- Leverage high-availability services. For example, if the router needs to store a customer’s cell, use S3 or DynamoDB, which have very high SLAs, instead of a single MySQL instance.
- Separate the control and data planes. For example, customer cells could be stored in S3, and the router could look up the data in the bucket. A separate control plane manages the bucket's contents, and the control plane can fail without affecting routing.
- Think about where authentication should happen. For example, should it be:
- In the router, which simplifies downstream services but adds a big blast radius if it fails.
- In the cells, which may add complexity and repetition to each cell.
- The router must be aware of cell locations and health to route requests away from failed or draining cells.
Limit Cell Communication
If multiple cell layers communicate with each other, it should be through well-defined APIs that help encapsulate the cell’s logic and allow services inside the cell to evolve without excessively breaking the API contract. Depending on the complexity requirements, this API may be exposed directly by services in the cell or a gateway at the edge of the cell.
Avoid chatty communication between cells. Limiting dependencies between cells will help them maintain fault isolation and avoid cascading failures.
You may want to use an internal layer to orchestrate traffic between cells, such as a service mesh, an API gateway, or a custom router. Again, care must be taken to ensure that whatever is used is not a single point of failure. Asynchronous messaging may also help as long as the messaging layer is reliable.
Leverage High Availability Cloud Service
As mentioned in the routing section above, many cloud services are already architected for high availability (often using cells like EBS and Azure AD). These services can simplify your choices and avoid reinventing the wheel.
Consider the SLAs of the cloud services, whether they are global, regional, or zonal, and how this will affect the system's performance if a given cloud service fails.
Potential Problems with Cell-based Architecture
Get Organization Buy-In
Cell-based architecture can be complex to build and run and has higher costs, so like many technology projects, it will need organizational buy-in to be successful.
For management, focusing on the business impact can be helpful, such as increased velocity (teams can more confidently deploy new code) and higher availability (happy customers and better reputation).
It will also need support and investment from architecture, DevOps, and development teams to build and run the cells with sufficient isolation, monitoring, and automation, so be sure to get them involved early to help guide the process.
Avoid Sharing Between Cells
Sharing resources like databases between cells may seem like a good way to reduce complexity and cost, but it reduces the isolation between cells and makes a failure in one cell more likely to affect other cells.
The key question is: how many cells would be affected if this shared resource failed? If the answer is many, then there is a problem, and the benefits of cell-based architecture are not being fully achieved.
A shared database can be a helpful step on a migration journey to cells but should not be shared indefinitely; there should also be a plan to split the database.
Avoid Creating an Overly-Complex Router
The router can be a single point of failure, and the risk of encountering some kind of failure increases with increased complexity. It can be tempting to add functionality to the router to simplify the cell services, but each decision must be weighed against the overall reliability of the system. Perform some failure mode analysis to identify and reduce the failure points in the router.
For example, if the router needs to look up cell mappings from a database, it may be quicker and more reliable to store the database in memory when starting up the router than relying on data access for every request.
Missing Replication and Migration Between Cells
It can be tempting to consider cell migration as an advanced feature and skip it at the start of the project, but it’s vital to the success of the architecture. If a cell fails or becomes overloaded (e.g. two big customers end up on the same cell), some customers need to migrate to another cell. How this looks in practice will depend on the routing and data partitioning, but the general idea is:
- Identify a cell to migrate to (either an existing cell with capacity or a newly created one).
- Replicate any required data from the old cell’s databases into the target one.
- Update the router configuration to make the target cell active for the relevant customers.
Integration with the routing layer is also needed to ensure requests are routed to the right cell at the right time.
Replication may be triggered by cell failure, or cells may be replicated so another cell is always ready to go. Exactly how this replication looks will depend on the cell’s data schema, recovery point objective (RPO), and recovery point objective (RTO) requirements: database-level replication, messaging, and S3 are all options. See the Disaster Recovery of Workloads on AWS whitepaper for more discussion of recovery strategies.
Avoid Limits on Cloud Resource
If the system consumes many cloud resources per cell, it may hit soft or hard limits imposed by the cloud provider. It may be possible to request increases to soft limits, but hard limits can be imposed by service or hardware constraints and are fixed.
On AWS, many limits can be avoided by using a separate account per cell.
Balance Duplication of Logic and Data
There is a tradeoff between keeping cells as isolated as possible and avoiding duplication of logic and data between services. The same tradeoff exists with microservices and the "Don’t Repeat Yourself" (DRY) principle.
As systems grow, it can be better to avoid tight coupling and promote isolation by duplicating code between services in different cells and potentially even duplicating data if it makes sense. There is no generic right or wrong answer to this problem: it should be assessed on a case-by-case basis. Conducting failure mode analysis can help identify when a dependency between cells might be a problem and when it should be removed, possibly by duplication.
Adoption Guidelines
You’ve decided a cell-based architecture is a good fit - now what?
Migration
To quote Martin Fowler: if you do a big-bang rewrite, the only thing you're certain of is a big bang.
Migrating an existing microservice architecture to become cell-based can be tricky. A common first step is to define the first cell as the existing system with a router put on top and then peel off services into new cells; in the same way, a monolith-to-microservice migration might happen.
Organizations can use many monolith-to-microservice strategies. For example:
- Use Domain-Driven Design (DDD) to define bounded contexts and help decide what goes in new cells.
- Migrate service logic into separate cells first, then split shared data into cell-specific databases in a subsequent phase.
- Consider which business areas would benefit from greater resiliency when choosing what to split into cells first.
- Ensure sufficient automation and observability to manage the new, more complex system.
Deployment
In a cell-based architecture, the unit of deployment is a cell. New application versions should be deployed to a single cell first to test how they interact with the rest of the system while minimizing the risk of widespread breakage. Use techniques like canaries or blue/green deployments to make incremental changes and verify that the system is still performing as expected before continuing the rollout (usually in waves).
If the new version has problems, the changes should be rolled back, and the deployment should be paused until further investigation can pinpoint the issue.
The concept of ‘bake time’ is also crucial to ensure the new cell has enough time to serve real traffic for monitoring to detect problems. The exact time will vary - minutes or hours depending on the kind of system, risk appetite, and complexity.
Observability
In addition to monitoring microservices the right way, there should be additional cell monitoring and dashboards to see aggregate and cell-level views of:
- The number of cells.
- Cell health.
- Status of deployment waves.
- Any SLO metrics important for a cell.
Many of these can be derived from standard cloud metrics, but additional tagging standards may be needed to get cell-level views.
Because cell-based architecture risks increasing cloud usage and, therefore, cost, it’s essential to keep track of resource usage and cost per cell. The goal is to allow teams to ask questions like "How much does my cell cost?", "How can I make more efficient use of the resources?" and "Is the cell size optimized?".
Scaling
In a cell-based architecture, the unit of scaling is a cell: more can be deployed horizontally in response to load. The exact scaling criteria will depend on the workload but could include the number of requests, resource usage, customer size, etc. How far scaling can be taken will depend on how isolated the cells are—any shared resources will limit scalability.
The architecture should also be careful to know the limits of a cell and avoid sending it more traffic than its resources can handle, for example, by load shedding by the router or cell itself.
Cell Sizing
Deciding on the size of each cell is a crucial tradeoff. Many smaller cells mean a smaller blast radius because each cell handles fewer user requests. A small cell can also be easier to test and manage (for example, a quicker deployment time).
On the other hand, larger cells may make better use of the available capacity, make it easier to fit a large customer into a single cell, and make the whole system easier to manage because there are fewer cells.
It’s a good idea to think about:
- The blast radius.
- Performance. How much traffic can be fitted into a cell, and how does it affect its performance?
- Headroom in case existing cells need to start handling traffic from a failed cell.
- Balancing the allocated resources to ensure cells are not underpowered to handle their expected load, but not overpowered and costing too much.
The advantages of smaller cells are:
- They have a smaller blast radius so that any failure will affect a smaller percentage of users.
- They are less likely to hit any cloud provider quota limits.
- They lower the risk of testing new deployments because targeting a smaller set of users is easier.
- Fewer users per cell mean migrations and failovers can be quicker.
The advantages of larger cells are:
- They are easier to operate and replicate because there are fewer of them.
- They utilize capacity more efficiently.
- Reduce the risk of having to split large users across multiple cells.
The correct choice will depend heavily on the exact system being built. Many organizations start with larger cells and move to smaller ones as confidence and tooling improve.
Data Partitioning
Closely related to cell sizing is partitioning data and deciding which cell customer traffic should be routed to. Many factors can inform the partitioning approach, including business requirements, the cardinality of data attributes, and the maximum size of a cell.
The partition key could be a customer ID if the requests can be split up into distinct customers. Each cell is assigned a percentage of the customers so that the same customer is always served by the same cell. If some customers are larger than others, then care should be taken that no single customer is bigger than the maximum size of the cell.
Other options are geographical region, market type, round-robin, or load-based.
Whatever approach is used, it can also be beneficial to override the router and manually place a customer in a specific cell for testing and isolating certain workloads.
Mapping
Using a customer ID implies the router will need to map customers to cells. The most straightforward approach to storing the mapping data could be a table that maps every customer to a cell:
The significant advantage is it’s pretty simple to implement and simplifies migrating customers between cells: just update the mapping in the database.
This approach's disadvantage is that it requires a database, which may be a single point of failure and cause performance concerns.
Other approaches are consistent hashing and mapping a range of keys to a cell. However, they are both less flexible as they risk hot cells, making migrations more challenging.
Measure Success
Ideally, organizations should consider adopting cell-based architecture to achieve specific business goals, such as improving customer satisfaction by improving the technology platform’s stability.
Through migration, it should be possible to measure what progress is made towards those goals. Often, the goal is resiliency in the face of failure, where some quantitative measures are useful:
- Health metrics, including error rates or uptime (e.g., when EBS moved to cells, the error rate dropped dramatically).
- MTTR (mean time to repair).
- Performance metrics, including p75, p95, and p99, request processing times to see if the extra layers adversely impact latency. Performance may improve if customers are now served from smaller cells than the previous system!
- Resource usage to make sure that cost is not getting out of control and can be optimized if necessary.
These all imply good observability to measure performance, reliability and cost.
Conclusions
Cell-based architecture can be daunting and complex to implement, but many good practices will be familiar to microservice developers. Any architecture at this scale should include deployment automation, observability, scaling, and fault recovery; cell-based architecture is no exception. These must be considered when designing cell size, cell isolation, data ownership, and a strategy to recover from failures.
Perhaps the crucial decision to be made is around data partitioning and, closely related, how request traffic is allocated and mapped to cells. More straightforward approaches may be easier to implement, but they typically lack the flexibility required to run cells at scale.
The public cloud providers offer many high-availability services that can be leveraged to improve reliability while simultaneously simplifying the design. AWS has the most presence online for cell-based architecture, with talks about how they applied this pattern to their own systems and advice for using AWS services to implement it.
Organizations must ensure that the cell-based architecture is the right fit for them and that the migration will not cause more problems than it solves. Migrating an existing system to a cell-based architecture can be done in steps to minimize disruption and validate the changes work as expected before proceeding.
The challenges in building modern, reliable, and understandable distributed systems continue to grow, and cell-based architecture is a valuable way to accept, isolate, and stay reliable in the face of failures.
This article is part of the "Cell-Based Architectures: How to Build Scalable and Resilient Systems" article series. In this series we present a journey of discovery and provide a comprehensive overview and in-depth analysis of many key aspects of cell-based architectures, as well as practical advice for applying this approach to existing and new architectures. |