Key Takeaways
- The three core strategies for managing failure in a microservices architecture are proactive testing, mitigation, and rapid response.
- If you have a small number of microservices or a shallow topology, consider delaying adoption of a service mesh, and evaluate alternative strategies for failure management.
- If you are deploying a service mesh, be prepared to invest ongoing effort in integrating the mesh into your software development lifecycle.
Read our ultimate guide to managing service-to-service communications in the era of microservices and cloud.
Read the guide
Microservices, done right, let you move faster
In today’s world, time to market is a fundamental competitive advantage. Responding quickly to market forces and customer feedback is crucial to building a winning business. Microservices is a powerful paradigm to accelerate your software agility and velocity workflow. By empowering different software teams to simultaneously work on different parts of an application, decision making is decentralized.
Decentralized decision making has two important consequences. First, software teams can make locally optimal decisions on architecture, release, testing, and so forth, instead of relying on a globally optimal standard. The most common example of this type of decision is release: instead of orchestrating a single monolithic application release, each team has its own release vehicle. The second consequence is that decision making can happen more quickly, as the number of communication hops between the software teams and centralized functions such as operations, architecture, and so forth is reduced.
Microservices aren’t free -- they introduce new failure modes
Adopting microservices has far-reaching implications for your organization, process, and architecture. In this article, we’ll focus on one of the key architectural shifts -- namely, microservices is a distributed system. In a microservices-based application, business logic is distributed between multiple services that communicate with each other via the network. A distributed system has many more failure modes, as highlighted in the fallacies of distributed computing.
Given these failure modes, it’s crucial to have an architecture and process that prevent small failures from becoming big failures. When you’re going fast, failures are inevitable, e.g., bugs will be introduced as services are updated, services will crash under load, and so forth.
As your application grows in complexity, the need for failure management grows more acute. When an application consists of a handful of microservices, failures tend to be easier to isolate and troubleshoot. As your application grows to tens or hundreds of microservices, with different, geographically distributed teams, your failure management systems need to scale with your application.
Managing failure
There are three basic strategies for managing failure: proactive testing, mitigation, and rapid response.
- Proactive testing. Implementing processes and systems to test your application and services so that failure is identified early and often. Classical “quality assurance” (QA) is included within this category, and although traditional test teams focused on pre-release testing, this frequently now extends to testing in production.
- Mitigation. Implement strategies to reduce the impact of any given failure. For example, load balancing between multiple instances of a service can insure that if a single instance fails, the overall service can still respond.
- Rapid response. Implement processes and systems to rapidly identify and address a given failure.
Service meshes
When a service fails, there is an impact on its upstream and downstream services. The impact of a failed service can be greatly mitigated by properly managing the communication between services. This is where a service mesh comes in.
A service mesh manages service-level (i.e., Layer 7) communication. Service meshes provide powerful primitives that can be used for all three failure management strategies. Service meshes implement:
- Dynamic routing, which can be used for different release and testing strategies such as canary routing, traffic shadowing, or blue/green deployments.
- Resilience, which mitigate the impact of failures through strategies such as circuit breaking and rate limiting
- Observability, which help improve response time by collecting metrics and adding context (e.g., tracing data) to service-to-service communication
Service meshes add these features in a way that’s largely transparent to application developers. However, as we’ll see shortly, there are some nuances to this notion of transparency.
Will a service mesh help me build software faster?
In deciding whether or not a service mesh makes sense for you and your organization, start by asking yourself two questions.
- How complex is your service topology?
- How will you integrate a service mesh into your software development lifecycle?
Your service topology
Typically, an organization will start with a single microservice that connects with an existing monolithic application. In this situation, the benefits of the service mesh are somewhat limited. If the microservice fails, identifying the failure is straightforward. The blast radius of a single microservice failure is inherently limited. Incremental releases can also likely be accomplished through your existing infrastructure such as Kubernetes or your API Gateway.
As your service topology grows in complexity, however, the benefits of a service mesh start to accumulate. The key constraint to consider is the depth of your service call chain. If you have a shallow topology, where your monolith directly calls a dozen microservices, the benefits of a service mesh are still fairly limited. As you introduce more service-to-service communication where service A calls service B which calls service C, a service mesh becomes more important.
Integrating your service mesh into your SDLC
A service mesh is designed to be transparent to the actual services that run on the mesh. One way to think about a service mesh is that it’s a richer L7 network. No code changes are required for a service to run on a service mesh.
However, deploying a service mesh does not automatically accelerate your software velocity and agility. You have to integrate the service mesh into your development processes. We’ll explore this in more detail in the next section.
Implementing failure management strategies as part of your SDLC
A service mesh provides powerful primitives for failure management, but alternatives to service meshes exist. In this section, we’ll walk through each of the failure management strategies, and discuss how the apply to your SDLC.
Proactive testing
Testing strategies for a microservices application should be as real-world as possible. Given the complexity of a multi-service application, contemporary testing strategies emphasize testing in production (or with production data).
A service mesh enables testing in production by controlling the flow of L7 traffic to services. For example, a service mesh can route 1% of traffic to v1.1 of a service, and 99% of traffic to v1.0 (a canary deployment). These capabilities are exposed through declarative routing rules (e.g., linkerd dtab or Istio routing rules).
A service mesh is not the only way to proactively test. Other complementary strategies include using your container scheduler such as Kubernetes to do a rolling update, an API Gateway that can canary deploy, or chaos engineering.
With all of these strategies, the question of who manages the testing workflow becomes apparent. In a service mesh, the routing rules could be centrally managed by the same team that manages the mesh. However, this likely won’t scale, as individual service author(s) presumably will want to control when and how they roll out new versions of their services. So if service authors manage the routing rules, how do you educate them on what they can and can’t do? How do you manage conflicting routing rules?
Mitigation
A service can fail for a variety of reasons: a code bug, insufficient resources, hardware failure. Limiting the blast radius of a failed service is important so that your overall application continues operating, albeit in a degraded state.
A service mesh mitigates the impact of a failure by implementing resilience patterns such as load balancing, circuit breakers, and rate limiting on service-to-service communication. For example, a service that is under heavy load can be rate limited so that some responses are still processed, without causing the entire service to collapse under load.
Other strategies for mitigating failure include using smart RPC libraries (e.g., Hystrix) or relying on your container scheduler. A container scheduler such as Kubernetes supports health checking, auto scaling, and dynamic routing around services that are not responding to health checks.
These mitigation strategies are most effective when they are appropriately configured for a given service. For example, different services can handle different volumes of requests, necessitating different rate limits. How do policies such as rate limits get set? Netflix has implemented some automated configuration algorithms for setting these values. Other approaches would be to expose these capabilities to service authors, who can configure the services correctly.
Observability
Failures are inevitable. Implementing observability -- spanning monitoring, alerting/visualization, distributed tracing, and logging -- is critical to minimizing the response time to a given failure.
A service mesh automatically collects detailed metrics on service-to-service communication, including data on throughput, latency, and availability. In addition, service meshes can inject the necessary headers to support distributed tracing. Note that these headers still need to be propagated by the service itself.
Other approaches for collecting similar metrics include using monitoring agents, collecting metrics via statsd, and implementing tracing through libraries (e.g., the Jaeger instrumentation libraries).
An important component of observability is exposing the alerting and visualization to your service authors. Collecting metrics is only the first step, and thinking how your service authors will create alerts and visualizations that are appropriate for the given service is important to closing the observability loop.
It’s all about the workflow!
The mechanics of deploying a service mesh are straightforward. However, as the preceding discussion hopefully makes clear, the application of a service mesh to your workflow is more complicated. The key to successfully adopting a service mesh is to recognize that a mesh impacts your development processes, and be prepared to invest in integrating the mesh into those processes. There is no one right way to integrate the mesh into your processes, and best practices are still emerging.
About the Author
Richard Li is the CEO/co-founder of Datawire, which builds open source tools for developers on Kubernetes. Previously, Richard was VP Product & Strategy at Duo Security. Prior to Duo, Richard was VP Strategy & Corporate Development at Rapid7. Richard has a B.S. and M.Eng. from MIT.