BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles ​​Applying Flow Metrics to Design Resilient Microservices

​​Applying Flow Metrics to Design Resilient Microservices

Log in to listen to this article

Key Takeaways

  • Every system is different, but every system has its limits in terms of capacity.
  • Measuring the flow of requests is a system-agnostic way of detecting breach of capacity.
  • This self-detection allows for effective communication during incidents.
  • Flow metrics are measured locally but scale globally, which can prevent cascading failures.
  • Flow metrics are easy to implement and have little overhead in terms of operation.

Software design with resilience is an acknowledgement to the reality that everything fails. In fact, as Werner Vogels puts it, everything fails all the time. We put metrics in place to help us detect and resolve such problems and failures. The reason we want to solve these problems is because they disrupt the business in one way or another – be it with bad customer experience or unexpected behaviour.

System Metrics Tend to Miss the Larger Picture

Metrics often become system-focused, which makes it very hard to assess the impact of a problem on the business. The customer sees a product as a whole – not as a collection of our systems. The dichotomy here is that our metrics originate in systems forming a bias to measure things that matter to the individual systems, and not the business as a whole.

Take the following very typical scenario. There are two teams which own two services – team X owns the mobile backend and team Y owns the inventory service. All of a sudden, the mobile backend notices a latency increase in calls to the inventory service.

The mobile backend team pages the inventory team, and the inventory team notices an increase in the number of incoming requests. However, the inventory service is taking on the load well and processing requests quite consistently. The inventory service is returning a very healthy stream of 200 statuses, communicating that everything is okay. Based on these metrics, the inventory team believes that the problem must be elsewhere:

What makes this situation so confusing is that there is no holistic view between the two services. Metrics that focus only on one micro-service’s view only understand their micro-slice of the problem - creating a clear disconnect between the metrics produced and the service’s value proposition to the business. Metrics with a localised focus will only try to find localised problems.

Software is only a Means for Business Value

We build software to add business value. Embedding the notion of business value in systems will align business goals and alleviate the problem of localised metrics.

Let's take an analogy. Say we want to build a product represented by the graph plot shown below. To do that, we first break it up into a set of tasks. These tasks fall under the scope of different teams. Each team works on tasks in their scope. The red team spends time building red tasks and they come out of the other end as a completed work item. In the end, the product is assembled to deliver what the customer wants.

Each phase from the left to the right is working on a given stage of building the product – from understanding the customer requirement, breaking it up into tasks, assigning tasks to teams, teams working on individual tasks and ultimately the assembly process. This is where Flow Metrics comes in. The customer wants the final assembled product, and is not particularly interested in the intermediate steps. Thus, each stage needs to do their part -- a single bottleneck anywhere will delay the assembly process.

Until the final assembly happens, all these items of work are only theoretically valuable for the customer. Flow is the movement of items of potential value to become something concrete for the customer. Flow Metrics, in turn, are the measure for flow.

Flow Metrics in Software Engineering

Through the software engineering process, a team picks up work items and spends time implementing them before it comes out of the other end as a completed work item.

In the software engineering process, we have four flow metrics:

  1. WIP (or work-in-progress) measures the number of items a team has started working on but has not yet completed. It is the number of tasks inside the funnel above.
  2. Age defines the time since an in-progress work item was started. It is the time an unfinished task spends inside the funnel.
  3. Cycle time: Once the implementation is complete, the amount of time elapsed between starting and finishing the task is called the cycle time, our third flow metric.
  4. Throughput is the number of items the team finishes per unit time.

These four metrics help assess how well teams are able to deliver what they are asked to deliver. Instead of building software, if we add users on one end and servers on the other, and replace work items with requests, the fundamental nature of the problem does not change. We should still be able to assess how well our system delivers what it is asked to deliver.

Applying Flow Metrics to Microservices

In its simplest form, a server serves requests to its users and depends on some other systems like a database. Users make requests to the server, the server makes queries to the database, the database returns query results to the server and the server then responds to the user.

Scenario 1: Request Spike

Let's take a scenario where the user makes more requests than expected -- a spike in incoming requests.

If the server is not able to process that many requests, there will be more requests on the fly, that is, more WIP. During this spike, if we trace a single request's path, by the time the request gets picked up by the server, it has already waited a long time, increasing its age.

Scenario 2: Degraded Dependency

Let's take another scenario. This time, assume the database has become sluggish. Because of that, the database is not able to produce query results at the same rate. This causes more and more queries to the database to accumulate.

The pending queries to the database will increase the WIP. The age of requests will also increase because it just takes longer for the database to produce results.

Note that this time there was no spike in incoming requests – there was only an internal issue, completely unrelated to the outside world. Yet we see the same increase in WIP and age.

WIP and Age are Leading Indicators

Whenever there is too much to do with not enough capacity, there will be congestion. Metrics must be able to detect where congestion is happening – unlike the two microservices we discussed in the beginning that could not agree on the source of the problem.

WIP and age can be observed on the fly. A request does not need to complete before we can measure these, as opposed to cycle time and throughput, which are only measurable after a request has been responded to. WIP and age are leading indicators of congestion.

Let’s now apply WIP and age to the scenario we discussed in the beginning – to recall, we observed the following:

  • Increase in requests received: The first plot in the image below shows an increase in the number of incoming requests to the service.
  • Increase in latency: The second plot shows the request latency as observed by the client.

Additionally, we have the two plots for WIP and Age, our leading indicators:

  • Age: The third plot below shows the age of requests inside the server. The age is divided into two components – the processing time and the waiting time. The waiting time is the amount a request spends waiting before it can be picked up for processing. A directly observable insight here is that the age of delayed requests is spent mostly on waiting, while the processing time remains constant.
  • WIP: The last plot shows the number of in-progress requests inside the server. The steady increase in WIP indicates that there are too many incoming requests for the server to process.

On-the-fly Resilience with Flow Metrics

When there is congestion, the response time increases. An unresponsive server leads to a dissatisfied customer – thus, just detecting congestion is not enough – we must account for it in our systems’ design in order to uphold the customer experience even in dire times.

In the two scenarios above, our server reported that WIP and the age were high. During congestion, most requests spent their age waiting for service. The red requests below will have to wait for the yellows and the greens to finish before any of the server threads even starts processing it.

To design with anticipation of congestion, we take two measures:

  • WIP Limit: We limit the number of requests that can be waiting for service by introducing a size limit on the queue of waiting requests
  • Age Filter: If a request has spent too long waiting, we will not serve the request

Because we have a limited queue size and a filter that drops requests, we need a gatekeeper of sorts that identifies when this happens. Gatekeeper threads do not do the request-processing, they only hand-off the request to the worker threads that process the request, call the dependency, etc.

When there is no place in the limited queue or the age filter drops the request, the gatekeeper responds to the client with an appropriate status, like 429 signifying that there are too many requests getting congested and the server is not able to keep up.

Why limit WIP and Age?

First of all, we want a reasonable quality of service, so that clients never wait longer than a certain amount of time. Second, when the server is not able to serve the client, the server acknowledges this and the gatekeeper communicates this effectively to the client with an appropriate status code.

Because there is an age filter with a smaller queue size, the server will recover much faster. There is only so much space now for the congestion to happen. Clearing a traffic jam on a highway takes much longer than a traffic jam in a small alleyway.

Result 1: Request Spike

In this section and the next, we put the age and WIP limit to the test. Recall the scenario when there was a sudden increase in traffic. Let's compare it to a server that limits WIP and age -- on the left is the default server and on the right is a server that limits WIP and age. The request spike happens for ten minutes between the blue dotted lines.

[Click here to expand image above to full-size]

The second WIP-age limiting server performs better during the request spike because of the following:

  • Latency during congestion: The default server on the left took up to three minutes to respond to the client, while the second server took six seconds for the same request rate. 
  • Recovery: Once the additional load went away, the default server took a much longer time to recover. 
  • Communication: Instead of returning 200 statues, the second server detected the congestion and gracefully rejected work with 429 statuses. Notice how the yellow bars (429 statuses) sit above the green bars (200 statuses) in this plot clearly showing that there were some requests that are beyond the server's capacity.

Result 2: Degraded Dependency

Instead of a spike in incoming requests, what happens when there is a degraded dependency, like a database becoming slow? Congestion builds up because the time to process a request increases.

[Click here to expand image above to full-size]

This time as well, the second WIP-age limiting server provides better service than the default server:

  • Latency during congestion: The default server on the left becomes unresponsive but the second server responds with a 429 status within a predefined maximum latency (age).
  • Recovery: When the degraded dependency improves, the second server recovers immediately, while the default server takes a longer time.
  • Communication: The original server again responds falsely with 200s, masking the slow responses, while the new server rejects work when it knows that it is not able to keep up. Notice how the yellow bars eat into the new server's capacity because of a slow dependency.

Even though the rate of incoming requests from the user is constant, the second server is able to reduce its capacity dynamically when there is a degraded dependency.

SLAs between Clients and Overloaded Servers

The approach we discussed here centres on the server providing a reasonable quality of service to the client. The implications of the decisions taken by the server to achieve this needs to be well understood by the client. For example, clients retrying every 429 response will only aggravate the situation. Service level agreements (SLAs) such as the following can help: (1) the server guarantees to respond within 5 seconds and (2) the client should use a spaced/back-off retry mechanism to allow the overloaded server time to recover.

Implementation with Spring Boot

The above scenarios and results are from this Github repository which implements two servers written in Spring Boot: (1) the default, no-limit server and (2) the WIP-Age limiting server.

Default No-Limit Server (NLS)

The default server implements a controller, which calls a method and returns its value – nothing special here, any simple Spring Boot server will look similar to this:

WIP-Age Limiting Server (LS)

The WIP-Age limiting server also calls a method from its controller but this time, instead of returning the result, the method returns a DeferredResult, which makes this controller asynchronous.

The handoffRequest method’s goal is to submit the request processing to a thread pool of workers and maintain the WIP and age limits. This happens in the body of the following code:

  • Line 4 creates a deferred result object in which the thread pool will put the real result asynchronously
  • Line 5 adds a request timeout handler (for when asynchronous processing takes longer than 10 seconds)
  • Line 8 submits the request to the thread pool for asynchronous processing of the request
  • Lines 12-15 implement the age filter based on how long a request waits for service
  • Line 17 calls the same method to process the request (the same method that the default server called above)
  • Lines 20-23 handle the exception when the WIP limit is exceeded

The WIP limit is enforced by the LinkedBlockingQueue that backs the thread pool. The queue has a size of 30 (this is the approximate WIP limit in this example). Whenever the threads in the pool are busy, a request will be enqueued if there is a free slot in the queue. Otherwise, a RejectedExecutionException will be thrown by the pool (this gets caught in the handoffRequest method above).

Note: Instead of direct handoffs, we use a bounded queue to avoid any backpressure in the handoff process (that may be caused by having to create new threads synchronously).

Local Flow Metrics Enable Global Resilience

We locally measured WIP and age on a single server. In all likelihood though, we will have a cluster of machines serving requests. But just because we have more machines, does not mean we have infinite capacity. Even the cluster's capacity may get exhausted. Like a supermarket with multiple checkout counters sees multiple queues build up during rush hour, each machine in the cluster will notice that the WIP and age are increasing locally when congestion builds up globally in the cluster.

We can abstract out the individual machines and view the cluster as a whole. Even though we are making local measurements on the machines of the cluster, we are not measuring the individual servers’ capacities. We are measuring how requests flow through the system irrespective of how many servers there are. If requests are getting served well in time for the client, whether there are 100 servers or just 1, it does not matter. However, if there are 1000 servers but the system becomes unresponsive to the client, there is a problem of congestion -- and each server will be able to detect the unresponsiveness based on how quickly requests arrive and depart from their local queue. Thus, local measurements of flow provide a mechanism for global resilience.

Independence from the System: Universal Applicability

Flow metrics are independent of many system characteristics like slow dependencies, garbage collection pauses, CPU utilisation, machines in a cluster etc. This is precisely what makes flow metrics universally applicable and so robust. Google Research published a paper in 2023 illustrating how Prequal, their load balancer for Youtube, uses flow metrics for balancing load instead of the commonly used metric of CPU load. They found that it is better to balance load based on requests-in-flight (WIP) and estimated latency (which is similar to our age metric).

TCP, the backbone of the internet for 30 years, uses flow metrics to avoid congestion. When there is a timeout, TCP reduces the number of in-flight packets – said another way, TCP reduces the WIP when age increases. Congestion control being bundled into TCP is regarded as a critical factor in the robustness of the Internet.

TCP’s congestion control is so robust because of its end-to-end nature. TCP makes no assumption and expects no cooperation from the underlying network when detecting or mitigating congestion. TCP simply measures if the network is able to deliver what was expected to be delivered. In the same vein, Flow Metrics do not depend on system characteristics, they look at whether the system is able to do the amount of work it is asked to do. Flow Metrics can therefore detect a wide variety of problems without assuming anything about the system they observe.

In service-oriented architectures, one congested microservice risks spreading congestion to other services. Flow metrics can prevent congestion from cascading by identifying work that is simply not doable and communicating this degradation gracefully. Doing this isolates the radius of congestion in one service from cascading across a system of interdependent microservices.

An Insurance Policy for the Flow of Business Value

Services exist to serve the client, and that alone is the server’s raison d’etre for the business. In other words, a request to the server is a unit of business value. That is essentially why observing the flow of requests is important. Doing this helps us detect and limit anything that disrupts the flow of requests – and thereby the flow of business value.

First, we detected when a system was at its limit by observing WIP and age of requests. Second, we focused on limiting the radius of the problem by denying undoable workloads. Third, we gracefully communicated to our clients that our system was degraded. Last, whenever the cause for the degradation is alleviated, we ensured the service could bounce back to normal operation.

All of this was possible by placing system-agnostic resilience in the design of our system. Resilience with flow metrics is simple to implement with very little resource overhead. It is an insurance policy: we do not leave our phones unlocked and vulnerable to attackers – similarly we should not allow the possibility of a sudden burst in traffic or a slow dependency or a memory leak to grind our service – and possibly others – to a halt. Flow Metrics based resilience is the insurance policy that can detect, mitigate and limit failures.

About the Author

BT