BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems

Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems

Key Takeaways

  • Cell-based architecture improves the resiliency and fault tolerance of microservices.
  • Observability is key for developing and operating cell-based architecture.
  • The cell router is a key component of the cell-based architecture, and it needs to react quickly to cell availability and health changes.
  • A holistic and comprehensive approach toward observability is required to achieve successful cell-based architecture adoption.
  • Cell-based architecture utilizes the same observability pillars as microservices but requires customization to accommodate elements specific to this type of architecture.
This article is part of the "Cell-Based Architectures: How to Build Scalable and Resilient Systems" article series. In this series we present a journey of discovery and provide a comprehensive overview and in-depth analysis of many key aspects of cell-based architectures, as well as practical advice for applying this approach to existing and new architectures.

The cell-based architectures have been an emerging paradigm in the last few years, with companies like Slack (which migrated the most critical user-facing services from monolithic to cell-based architectures), Flickr (which employed a federated approach to store the users’ data on a shard or cluster of many services), Salesforce (which designed a solution in terms of pods, with functionality self-contained consisting of 50 nodes), and Facebook (which proposed building blocks with services called cells, each cell consisting of a cluster, a metadata store, and controllers in Zookeeper). They have used these architectures to address the challenges of resilience and fault tolerance. The reasons for becoming popular include isolation of failures, improved scalability, simplified maintenance, enhanced fault tolerance, flexibility, and cost-effectiveness.

In the journey to achieve resilience and fault tolerance, the proponents of cell-based architectures have relied on observability, which has played a crucial role in complementing the implementations. This is the case for Interact, one of the first companies that documented how observability was vital to guaranteeing healthy cell-based architectures. The engineering team of Interact used observability to provide deep insights into system behavior, enabling them to detect issues proactively and facilitating faster recovery from failures. Specifically, they used the maximum number of hosted clients and the maximum number of daily requests per cell to create the new infrastructure alongside the existing architecture.

This article delves into the resiliency and fault tolerance benefits of adopting cell-based architectures, focusing on the observability aspect. The first part addresses a common question: why use cell-based architectures if microservices are already resilient and fault-tolerant? With that explanation, the second part focuses on observability and the considerations for analyzing inputs and outputs of the cell-based architectures. Finally, it goes over the best practices and takeaways to achieve the visibility needed to detect issues early, diagnose problems quickly, and make informed decisions that enhance resilience and fault tolerance.

Why Use Cell-based Architectures if Microservices are Resilient and Fault Tolerant?

It is a fact that microservices reduce the risk that a single error can bring down the entire system because they use smaller and independently deployable units. This paradigm allows for a failure in a microservice not to affect the whole application. However, it is also a reality that dealing with the complexity due to inter-service communication reduces resilience and fault tolerance levels. While microservices are well-suited to handle large-scale enterprise applications focused on modularity and manageability, cell-based architectures offer advantages in scenarios requiring extreme modularity, scalability, and resource efficiency. It was the reason why Tumblr, which crossed from being a startup to a wildly successful company in a few months, preferred to migrate from a monolith to cell-based architectures and not to microservices. The scalability was a priority for them since they had to evolve their infrastructures while handling huge month-over-month increases in traffic.

Cell-Based Strategy: High Availability for Rapid Growth Requirements

Opting for architectures based on microservices requires carefully analyzing the balance between its advantages and drawbacks. While it offers improved scalability, fault tolerance, and easier operations, it also introduces complexities in implementation and management. However, cell-based architectures are a good fit for systems that prioritize high availability, where experiencing rapid growth is required, or the ability to scale individual components and isolate failures is a priority.

Cell-based architecture is not a universal solution but a strategic choice that aligns with specific business and technical needs. Figure 1 illustrates how architectures based on microservices can segregate larger systems into components that wrap up bounded context business domains. Figure 2 shows how cell-based architectures simplify the complexity associated with communication among those services, where each cell is identical and represents an entire stack scaling independently.

Figure 1. Architectures based on Microservices

Figure 2. Architectures based on cells

Regarding Figure 2, there are two perspectives for implementing cell-based architectures: one in which cells are immutable components that together provide a service, and another in which each cell is identical and represents an entire service. In the first perspective, cells can communicate with each other. In the second, cells are built, deployed, and managed independently as a complete unit since there is no communication among cells.

Cell-based architectures can offer improved resilience and fault tolerance, but how can the operators determine if the system delivers these benefits? The answer is observability.

Considerations to Observe Cell-based Architectures

Cell-based architectures offer a robust approach to building resilient systems. They achieve this through the core principles of isolation, autonomy, and replication. Each cell operates independently, managing its resources and making decisions autonomously. Data and critical services are replicated within the cell for enhanced availability.

These architectures distribute cells across multiple zones or data centers to ensure resilience and fault tolerance, protecting against regional outages. Continuous health checks and monitoring detect failures early, while circuit breakers prevent cascading failures. Load balancing ensures efficient traffic distribution and graceful degradation prioritizes essential functionality during partial failures. Chaos engineering regularly tests resilience by simulating failures.

Observability is the state-of-the-art tool for understanding the state and inner workings of current implementations. Although a system can work without this, collecting, processing, aggregating, and displaying real-time quantitative metrics increases resilience and fault tolerance. That is precisely one of the reasons for including it as a principle in Site Reliability Engineering.

Observability is a Pillar of Great Architectures

In addition to being a strategy for understanding the behavior of a system, observability is crucial to achieving the goals of good architecture, particularly in the areas of operational excellence, reliability, and performance efficiency. Figure 3 illustrates the common pillars of well-architected frameworks and gives visibility to their relation to observability. In terms of operational excellence, observability provides the insights needed to understand how systems perform, identify potential problems, and make informed decisions about optimizing them. To achieve performance efficiency, observability enables organizations to identify bottlenecks and inefficiencies in their systems so they can take action to improve performance and reduce costs. Finally, reliability, through monitoring system behavior and detecting anomalies early, observability helps to prevent failures and minimize downtime.

Figure 3. Well-Architected Framework + Observability

In the path to observe a cell-based architecture, the first step is to define the objectives and identify the metrics such as mean time between failures (MTBF), mean time to repair (MTTR), availability, and recovery time objectives (RTOs), which are well-suited to assess a resilience and fault tolerance level. Once the metrics are clear, the next activity is to provide instrumentation mechanisms incorporating logging, metrics collection, tracing, and event tracking to gather relevant data. A robust infrastructure is then established to collect and aggregate this data efficiently. At this point, observers usually store the collected data in appropriate repositories like time-series databases and process it through filtering, transformation, and enrichment. Analysis tools and visualizations derive insights, identify patterns, and detect anomalies. These insights are integrated into development and operational workflows, establishing feedback loops that drive system design and performance improvements. Finally, the process is iteratively refined, with continuous adjustments to instrumentation, data collection, and analysis based on feedback and evolving requirements. The complete process is illustrated in Figure 4.

Tailoring Observability for Cell-based Architectures

Observability for cell-based architecture requires a tailored approach to address the unique challenges and opportunities presented by this distributed system design. Considering that observability is about monitoring, tracing, and logging, the cell-aware instrumentation includes collecting metrics at the cell level, that is, in general terms, capturing resource utilization (CPU, memory, network), request latency, error rates, and custom business metrics relevant to each cell’s function. Distributed tracing is about implementing tracing to track requests across cell boundaries, providing insights into the flow of interactions, and pinpointing bottlenecks. Finally, log aggregation should come from individual cells into a centralized system, allowing for correlation and analysis across the entire architecture.

A second consideration is the creation of cell-level dashboards tailored to each cell’s specific functions and KPIs, which enable proper monitoring and troubleshooting. With this configuration, cell-specific alerts based on cell-specific thresholds and anomalies ensure prompt notification of issues affecting individual cells.

A third consideration related to best practices in observability is the need for a unique project that integrates data from various cell-level observability tools into a centralized platform for holistic monitoring and analysis. This makes it easier to leverage the centralized platform to correlate events and metrics across cells, revealing dependencies and potential cascading failures.

Figure 4. A proposed framework for observability in cell-based architectures.

A final consideration is cell isolation, which tests individual cells to identify performance bottlenecks and failure modes specific to their functionality. In this consideration, chaos experiments are expected to be designed and developed to allow for controlled disruptions (e.g., network latency, resource constraints) at the cell level to assess resilience and identify weaknesses.

By implementing these practices, organizations can gain deep visibility into the behavior of their cell-based architecture, enabling proactive monitoring, faster troubleshooting, and improved overall system reliability and performance. Always keep in mind that the composition of a cell itself can vary from business to business, which could be an advantage since diversity is precisely one of the benefits of cell-based architectures.

How the Routing Layer Provides Resilience, Fault Tolerance, and Observability

In addition to cells and the control plane, cell routing is crucial in providing resilience and fault tolerance in cell-based architectures. It has the mission to distribute requests to the correct cell based on the partition key, presenting a single endpoint to clients. According to DoorDash, this component offers several benefits, including maintaining traffic balance, even when services are unevenly distributed across availability zones. This makes it possible to dynamically set traffic weights between pods, eliminating the manual operation and reducing the blast radius of a single or multi-AZ outage, which is critical in fault tolerance and reducing traffic latencies because the caller services connect to more proximal callees.

For reaching fault tolerance in networks, the routing layer uses several mechanisms, which have been documented as innovative solutions for providing resilience. Among them is path redundancy, in which the routing protocols discover and maintain multiple paths to a destination; in that way, if the primary path fails, traffic is automatically rerouted through an alternate path. Another strategy is fast rerouting, designed to detect failures quickly and converge on a new routing solution, minimizing downtime and service disruptions; the classical load balancing that distributes traffic across multiple paths, which prevents congestion and optimizes network resource utilization. And finally, failure detection and recovery where the routing protocol triggers the recovery process to find an alternate path once a failure is detected.

The Role of the Routing Layer in Architecture Observability

The routing layer also significantly impacts observability due to the distributed nature of cell-based systems. As it is a component that centralizes the operation of the cells, it is the best candidate to provide insights into the health and performance of the entire system. Observing the architecture from this component allows traffic patterns, latency, and errors at various points in the network. This allows operators to pinpoint bottlenecks, identify failing components, and optimize routing decisions for better performance.

Furthermore, the routing layer can be instrumented to collect detailed metrics and logs, providing valuable data for troubleshooting and root cause analysis. For instance, tracing the path of a request across multiple cells can reveal where delays occur or errors originate. This granular visibility is essential for maintaining the reliability and efficiency of complex cell-based applications.

In conclusion, the routing layer in a cell-based architecture is not only responsible for directing traffic but also serves as a critical component for observability. Monitoring and analyzing traffic patterns provide valuable insights into the system’s behavior, enabling proactive troubleshooting and optimization. This ensures that the cell-based system remains resilient and scalable and performs optimally under varying workloads.

Best Practices to Provide Resilience, Fault Tolerance, and Observability to Cell-based Architectures

Observability in cell-based architectures is crucial for maintaining system health and performance. A fundamental best practice is centralized logging, where logs from all cells are aggregated into a unified repository. This consolidation simplifies troubleshooting and analysis, allowing operators to quickly identify and address issues across the entire system. Structured logging formats further enhance this process by enabling efficient querying and filtering of log data.

Metrics and Monitoring

Metrics and monitoring are equally vital components of observability. Collecting detailed metrics on cell performance, resource utilization, and error rates provides valuable insights into the system’s behavior. Setting up dashboards and alerts based on these metrics allows for proactively identifying anomalies and potential bottlenecks. Visualization tools like Grafana can effectively display these metrics, making it easier to spot trends and patterns that may indicate underlying problems.

Distributed Tracing

Distributed tracing is another essential practice for understanding the flow of requests through a cell-based architecture. By tracking requests as they move across multiple cells, operators can pinpoint performance bottlenecks, latency issues, and failures in microservices interactions. Distributed tracing tools like Jaeger, Zipkin, or AWS X-Ray can help visualize these complex interactions, making diagnosing and resolving problems arising from inter-cell communication more straightforward.

Alerting and Incident Management

Alerting and incident management are integral to a well-rounded observability strategy. Configuring alerts based on predefined thresholds or anomalies in logs and metrics enables timely notifications of potential issues. These alerts can be sent through various channels like email and SMS or integrated into incident management platforms like PagerDuty. Having well-defined incident management processes ensures swift and organized responses to alerts, minimizing downtime and impact on the overall system.

A Holistic Approach to Observability

In addition to these core practices, adopting a holistic approach to observability is beneficial. This includes regularly reviewing and refining logging, monitoring, and tracing configurations to adapt to evolving system requirements. Additionally, incorporating feedback from incident postmortems can help identify areas for improvement in the observability strategy. By continuously enhancing observability, organizations can ensure their cell-based architectures remain resilient, performant, and easy to manage.

This article is part of the "Cell-Based Architectures: How to Build Scalable and Resilient Systems" article series. In this series we present a journey of discovery and provide a comprehensive overview and in-depth analysis of many key aspects of cell-based architectures, as well as practical advice for applying this approach to existing and new architectures.

About the Author

Rate this Article

Adoption
Style

BT