Key Takeaways
- Distributing tracing is increasingly seen as an essential component for observing distributed systems and microservice applications. There are several popular open source standards and frameworks like the OpenTracing API and OpenZipkin
- The basic idea behind distributed tracing is relatively straightforward -- specific request inflexion points must be identified within a system and instrumented. All of the trace data must be coordinated and collated to provide a meaningflow view of a request
- Request tracing is similar in concept to Application Performance Management (APM), and an emerging challenge within both ecosystems is processing the volume of the data generated from increasingly large-scale systems
- Google overcame this issue when implementing their Dapper distributed tracing system by sampling traces, typically 1 in 1000, but modern commercial tracing products claim to be able to analyse 100% of requests.
Distributing tracing is increasingly seen as an essential component for observing distributed systems and microservice applications. This article provides an introduction to and overview of this technique, starting with an exploration of Google’s Dapper request tracing paper -- which in turn led to the creation of the Zipkin and OpenTracing projects -- and ending with a discussion of the future of tracing with Ben Sigelman, creator of the new LightStep [x]PM tracing platform.
As stated in the original Dapper paper, modern Internet services are often implemented as complex, large-scale distributed systems -- for example, using the popular microservice architectural style. These applications are assembled from collections of services that may be developed by different teams, and perhaps using different programming languages. At Google-scale these application span thousands of machines across multiple facilities, but even for relatively small cloud computing use cases it is now recommended practice to run multiple versions of a service spread across geographic “availability zones” and “regions”. Tools that aid in understanding system behaviour, help with debugging, and enable reasoning about performance issues are invaluable in such a complex system and environment.
The basic idea behind request tracing is relatively straightforward: specific inflexion points must be identified within a system, application, network, and middleware -- or indeed any point on a path of a (typically user-initiated) request -- and instrumented. These points are of particular interest as they typically represent forks in execution flow, such as the parallelization of processing using multiple threads, a computation being made asynchronously, or an out-of-process network call being made. All of the independently generated trace data must be collected, coordinated and collated to provide a meaningful view of a request’s flow through the system. Cindy Sridharan has provided a very useful guide that explores the fundamentals of request tracing, and also places this technique in the context of the other two pillars of modern monitoring and “observability”: logging and metrics collection.
Decomposing a Trace
As defined by the Cloud Native Computing Foundation (CNCF) OpenTracing API project, a trace tells the story of a transaction or workflow as it propagates through a system. In OpenTracing and Dapper, a trace is a directed acyclic graph (DAG) of “spans”, which are also called segments within some tools, such as AWS X-Ray. Spans are named and timed operations that represent a contiguous segment of work in that trace. Additional contextual annotations (metadata, or “baggage”) can be added to a span by a component being instrumented -- for example, an application developer may use a tracing SDK to add arbitrary key-value items to a current span. It should be noted that adding annotation data is inherently intrusive: the component making the annotations must be aware of the presence of a tracing framework.
Trace data is typically collected “out of band” by pulling locally written data files (generated via an agent or daemon) via a separate network process to a centralised store, in much the same fashion as currently occurs with log and metrics collection. Trace data is not added to the request itself, because this allows the size and semantics of the request to be left unchanged, and locally stored data can be pulled when it is convenient.
When a request is initiated a “parent” span is generated, which in turn can have causal and temporal relationships with “child” spans. Figure 1, taken from the OpenTracing documentation, shows a common visualisation of a series of spans and their relationship within a request flow. This type of visualisation adds the context of time, the hierarchy of the services involved, and the serial or parallel nature of the process/task execution. This view helps to highlight the system's critical path, and can provide a starting point for identifying bottlenecks or areas to improve. Many distributed tracing systems also provide an API or UI to allow further “drill down” into the details of each span.
Figure 1. Visualising a basic trace with a series of spans over the lifetime of a request (image taken from the OpenTracing documentation)
The Challenges of Implementing Distributed Tracing
Historically it has been challenging to implement request tracing with a heterogeneous distributed system. For example, a microservices architecture implemented using multiple programming languages may not share a common point of instrumentation. Both Google and Twitter were able to implement tracing by creating Dapper and Zipkin (respectively) with relative ease because the majority of their inter-process (inter-service) communication occurred via a homogenous RPC framework -- Google had created Stubby (a variant of which has been released as the open source gRPC) and Twitter had created Finagle.
The Dapper paper makes clear that the value of tracing is only realised through (1) ubiquitous deployment -- i.e. no parts of the system under observation are not instrumented, or “dark” -- and (2) continuous monitoring -- i.e. the system must be monitoring constantly, as unusual events of interest are often difficult to reproduce.
The rise in popularity of “service mesh” network proxies like Envoy, Linkerd, and Conduit (and associated control planes like Istio) may facilitate the adoption of tracing within heterogeneous distributed systems, as they can provide the missing common point of instrumentation. Sridharan discusses this concept further in her Medium post discussing observability:
“Lyft famously got tracing support for all of their applications without changing a single line of code by adopting the service mesh pattern [using their Envoy proxy]. Service meshes help with the DRYing of observability by implementing tracing and stats collections at the mesh level, which allows one to treat individual services as blackboxes but still get incredible observability onto the mesh as a whole.”
The Need For Speed: Request Tracing and APM
Web page load speed can dramatically affect user behaviour and conversion. Google ran a latency experiment using its search engine and discovered that by adding 100 to 400 ms delay to the display of the results page resulted in a measurable impact on the number of searches a user ran. Greg Linden commented in 2006 that experiments ran by Amazon.com demonstrated a significant drop in revenue was experienced when 100ms delay to page load was added. Although understanding the flow of a web request through a system can be challenging, there can be significant commercial gains if performance bottlenecks are identified and eliminated.
Request tracing is similar in concept to Application Performance Management (APM) -- both are related to the monitoring and management of performance and availability of software applications. APM aims to detect and diagnose complex application performance problems to maintain an expected Service Level Agreement (SLA). As modern software architectures have become increasingly distributed, APM tooling has adapted to monitor (and visualise) this. Figure 2 shows a visualisation from the open source Pinpoint APM tool, and similar views can be found in commercial tooling like Dynatrace APM and New Relic APM.
Figure 2. Request tracing within modern APM tooling (image taken from the Pinpoint APM GitHub repository)
An emerging challenge within the request tracing and APM space is processing the volume of the data generated from increasingly large-scale systems. As stated by Adrian Cockcroft, VP of Cloud Architecture Strategy at AWS, public cloud may have democratised access to powerful and scalable infrastructure and services, but monitoring systems must be more available (and more scalable) than the systems that they are monitoring. Google overcame this issue when implementing Dapper by sampling traces, typically 1 in 1000, and still found that meaningful insight could be generated with this rate. Many engineers and thought leaders working within the space -- including Charity Majors, CEO of Honeycomb, an observability platform -- believe that samping of monitoring data is essential:
It’s this simple: if you don’t sample, you don’t scale. If you think this is even a controversial statement, you have never dealt with observability at scale OR you have done it wastefully and poorly.
InfoQ recently attended the CNCF CloudNativeCon in Austin, USA, and sat down with Ben Sigelman, one of the authors of the original Dapper paper and CEO and co-founder of LightStep, who has recently announced a new commercial tracing platform, “LightStep [x]PM”. Sigelman discussed that LightStep’s unconventional architecture (which utilises machine learning techniques within locally installed agents) allows the analysis of 100.0% of transaction data rather than 0.01% that was implemented with Dapper:
“What we built was (and is still) essential for long-term performance analysis, but in order to contend with the scale of the systems being monitored, Dapper only centrally recorded 0.01% of the performance data; this meant that it was challenging to apply to certain use cases, such as real-time incident response (i.e., “most firefighting”).
LightStep have worked with a number of customers over the past 18 months -- including Lyft (utilising the Envoy proxy as an integration point), Twilio, GitHub, and DigitalOcean -- and have demonstrated that their solution is capable of handling high volumes of data:
“Lyft sends us a vast amount of data – LightStep analyzes 100,000,000,000 microservice calls every day. At first glance, that data is all noise and no signal: overwhelming and uncorrelated. Yet by considering the entirety of it, LightStep can measure how performance affects different aspects of Lyft’s business, then explain issues and anomalies using end-to-end traces that extend from their mobile apps to the bottom of their microservices stack.”
LightStep [x]PM is currently available as a SaaS platform, and Sigelman was keen to stress that although 100% of requests can be analysed, not all of this data is exfiltrated from the locally installed agents to the centralised platform. Sigelman sees this product as a “new age APM” tool, which will provide value to customers looking for performance monitoring and automated root cause analysis of complex distributed applications.
Conclusion
Response latency within distributed systems can have significant commercial impact, but understanding the flow of a request through a complex system and identifying bottlenecks can be challenging. The use of distributed tracing -- in combination with other techniques like logging and monitoring metrics -- can provide insight into distributed applications like those created using the microservices architecture pattern. Open standards and tooling are emerging within the space of distributed tracing -- like the OpenTracing API and OpenZipkin -- and commercial tooling is also emerging which potentially competes with existing APM offerings. There are several challenges with implementing distributed tracing for modern Internet services, such as processing the high volume of trace data and generating meaningful insight, but both the open source ecosystem and vendors are rising to the challenge.
About the Author
Daniel Bryant is leading change within organisations and technology. His current work includes enabling agility within organisations by introducing better requirement gathering and planning techniques, focusing on the relevance of architecture within agile development, and facilitating continuous integration/delivery. Daniel’s current technical expertise focuses on ‘DevOps’ tooling, cloud/container platforms and microservice implementations. He is also a leader within the London Java Community (LJC), contributes to several open source projects, writes for well-known technical websites such as InfoQ, DZone and Voxxed, and regularly presents at international conferences such as QCon, JavaOne and Devoxx.