Discovering which services need to be optimised to reduce end-to-end latency in a microservices-based system can be challenging because call graphs may be too complicated to read. Uber described an open-source tool called CRISP built to solve this problem by finding the critical paths in these graphs. Critical paths identify those operations whose optimisation benefits the overall system.
The use of specialised and independent deployment units that interact with each other is the fundamental characteristic of microservice architectures. Therefore, fulfilling one request in a microservice architecture usually involves several interactions between different microservices. The request's end-to-end latency results from a combination of all interaction latencies.
Understanding which interactions have the most significant impact helps prioritise optimisations, system configuration, capacity planning or problem diagnostics. However, it becomes increasingly difficult to reason about the impact of each interaction as microservice-based systems grow larger in size and complexity.
Millind Chabbi and Chris Zhang, researchers, and Murali Ramanathan, engineer, pointed out that the interactions between microservices form a call graph, specifically a directed acyclic graph (DAG). A rich research corpus about the study of computational graphs composed of parallel tasks already exists. Since a microservice architecture can be considered a distributed system, that research also applies to the analysis of DAGs representing calls between microservices.
A DAG representing the flow of parallel computational tasks with node weights proportional to their work. Source: https://eng.uber.com/crisp-critical-path-analysis-for-microservice-architectures/
The DAG's critical path is the key to understanding how to reduce latencies in that specific call graph. The article authors defined a critical path as follows:
Formally, the critical path in a parallel computation DAG is the longest weighted (where each node has an associated weight) path through the DAG. (...) The critical path in a parallel program DAG represents the components of the program speeding, which will speed [up] the overall execution time of the parallel program.
Uber uses Jaeger extensively to trace transactions between microservices. Jaeger generates tracing information based on "spans" used to understand the complete chain of calls related to one service request. Jaeger's spans are units of work that can have parent-child relationships: a child span represents a call to some downstream service, including its start time and duration.
Although Jaeger provides a user interface for visualising call graphs and timelines, they are difficult to understand visually in a complex environment such as Uber's.
CRISP solves this problem by reducing Jaeger traces to the essential, i.e., their critical paths. This reduction is accomplished by converting them into computational DAGs, followed by an analysis to calculate the critical path in each DAG.
Computing the critical path from an individual trace involves replaying the trace in reverse order from the end of the trace to the beginning of the trace, looking for synchronisation events.
A synchronisation event is a moment when one operation waits for the completion of another before it can continue. The end of a span is thus a synchronisation event.
A Jaeger trace represented as a DAG. Source: https://eng.uber.com/crisp-critical-path-analysis-for-microservice-architectures/
The red-coloured spans in the figure above are the critical path in the DAG. Other spans are not considered part of the critical path because reductions in their execution times do not impact the total execution time.
Since microservices are independent deployment units, they often run in different hardware. Critical path analysis depends on accurate and comparable timestamps, which can be challenging to achieve due to clock skew. CRISP's algorithm employs some strategies to minimise the impact of clock skew in span timestamps.
Multiple critical paths can be aggregated to obtain a comprehensive picture of the entire system. When doing so, CRISP provides a view of each critical path's operation time contributions to a global operation pool, summarised at percentiles 50, 95 and 99.
CRISP results are presented under different representations, such as flame graphs, heatmaps and textual summaries.
Comparable work has been carried out, for example, under the HPCToolkit project, whose purpose is to measure and analyse multithreaded or multiprocess program performance. CRISP is concerned with the interactions between programs on a distributed system, whereas HPCTookit's focus is the programs themselves.