At KubeCon NA, held in Seattle, USA, in December 2018, Ben Sigelman presented "Three Pillars, Zero Answers: We Need to Rethink Observability" and argued that many organisations may need to rethink their approach to metrics, logging and distributed tracing. Setting business-level goals and requirements in relation to observability is vitally important, as is being clear about the strengths, weaknesses and tradeoffs being made with any tooling solution. Observability tooling should be able to assist with detecting issues that are user experience-impacting, and should also help with refining the search space where the associated problem or root cause lurks.
Sigelman, CEO at LightStep and ex-Googler, began the talk with a critique of conventional wisdom around observability. Although it is currently generally accepted that observing microservice-based systems is challenging, many people believe that Bay-Area "unicorn" organisations, like Google and Facebook, have previously solved this issue. From their publicly discussed approaches and solutions, a concept referred to as the "three pillars of observability" has emerged, which consists of: metrics, logging and distributed tracing. Sigelman mused that the prevailing wisdom is that if the implementation of this concept worked for these organisations, then others will benefit from adopting this approach too. However, a common mistake that engineers can make is to look at each of the pillars in isolation, and then simply evaluating and implementing individual solutions to address each pillar. Instead a holistic approach must be taken.
In response to this challenge, a number of open source and commercial observability tools have emerged over the past few years. Each tool typically prioritises one approach over another, and potentially emphasises one or two of the pillars over the others. Accordingly, each solution has associated strengths and weaknesses, and we need to be aware of this when evaluating tools. Ideally we require some kind of scorecard that would enable us to effectively critique tools, and choose the one most appropriate to our requirements. Sigelman proposed that there is general reluctance of creators or maintainers of a tool discussing associated weaknesses at conferences, compared against his time working in Google:
"No one is trying to sell anything else to anyone at Google, so you would willingly represent your flaws with a service. You would build something, and say "this is what's good about it" and "this is what's not good about it" [...] Every system has a tradeoff"
Next in the talk, each of the pillar's "fatal flaws" were explored. The big challenge associated with metrics is in dealing with high cardinality (which Charity Majors, CEO of Honeycomb, has also discussed in multiple articles, including on InfoQ). Graphing metrics often provides visibility that allows humans to understand that something is going wrong, and then the associated metric can often be explored further by diving-deeper into the data via an associated metdata tag e.g. user ID, transaction ID, geolocation etc. However, many of these tags have high cardinality, which presents challenges for querying data efficiently.
The primary challenge with logging is the volume of data collected. Within a system based on a microservices architecture, the amount of logging is typically the multiple of the number of services by transaction rate. The total cost for maintaining the ability to query on this data can be calculated by further multiplying the number of total transaction by the cost of networking and storage, and again by the number of weeks of retention required. For a large scale system, this cost can be prohibitive.
In regard to distributed tracing, a core challenge here is minimising the overhead during collection. Creating a trace, propagating this, and adding and storing additional metadata ("baggage") on the trace impacts the application being instrumented, as well as the associated networking and storage devices. The typical approach to mitigate this is to sample the traces, for example only instrumenting one in one thousand requests.
Sigelman shifted gears at this point within the talk, and introduced part two of the presentation: "A New Scorecard for Observability". He began with several definitions. First, a service level indicator (SLI) is an indicator of some aspect of "health" that a service's consumers would care about. Second, the mental model we use to improve SLIs consists of goals and activities: goals are how our services perform in the eyes of their consumers; and activities are what we (as operators) actually do to further our goals.
There are two fundamental goals with observability: gradually improving an SLI (potentially optimising this over days, weeks, months), and rapidly restoring an SLI (reacting immediately, in response to an incident). There are two fundamental activities with observability: detection, which is the ability to measure SLIs precisely; and refinement, which is the ability to reduce the search space for plausible explanations of an issue. Current tooling is generally good at detection, but there are still challenges remaining with refinement. This is partly due to the fact that as more complexity is added to a system -- and with microservice-based systems this typically correlates with adding more services -- the search space increases geometrically.
Sigelman believes the scorecard for detection should consist of the properties of "specificity, fidelity and freshness". Specificity is the cost of cardinality (the $ cost per tag value) and what stack elements you can measure effectively e.g. metrics from mobile/web platforms, managed services, "black-box OSS infra" like Kafka or Cassandra. Fidelity consists of the ability to employ correct statistical approaches e.g. calculating global p95 or p99, and high metrics collection frequency in order to detect intermittent issues e.g. sampling frequency in seconds. Freshness is the lag of analysis of metrics from real-time collection, in seconds.
The scorecard for refinement included the properties of "identifying variance and explaining variance". Identifying variance was further subdivided into: cardinality (the $ cost per tag value), robust statistics, i.e. the ability to use visualisations like histograms to "go beyond percentiles" and allow refinement of latency; and the retention horizons (time period) that enable plausible queries to allow identification of "normal behaviour" over time. Explaining variance consists of the use of correct statistics e.g. do not calculate a series of p99 values and average across them -- instead calculate the p99 at the end of the aggregation; and the ability to "suppress the messengers" of microservices failures, which is the ability to locate the source of any issue, rather than the locations of issues resulting from the cascading of the initial problem.
Wrapping up the talk, Sigelman proposed that when designing your own observability system, you will have to make tradeoffs, and stated that he believes you can only choose three of the following four properties: high-throughput within the application under observation, high-cardinality (for querying and refinement), lengthy metric/log data retention windows, and unsampled data (by the observability tooling).
Cindy Sridharan, author of several highly referenced observability and monitoring blog posts, noted on Twitter that this is "Like the CAP theorem but for your Observability system". Charity Majors replied on the tweet thread, and stated that the original conjecture was "Not even close to true. You can absolutely have all four, you just can't choose 'cheap'". Sigelman replied, "A few of my colleagues made the same point... s/observability system/positive-ROI observability system/g", which in essence provides the correction that when he made his original three-out-of-four conjecture he was referring to an observability system that has a positive return on investment (ROI).
Concluding, Sigelman stated that he believes tracing data is typically the best data source we have for addressing the issues discussed, as "it is the superset of logging data [...], which is super-valuable for the suppression of hypotheses".
InfoQ spoke to Sigelman before his talk, and discussed his views of the current adoption of microservice architecture and the state of observability tooling. He suggested that microservices, while not suitable for every application or organisation, "are absolutely a good approach for solving a certain subset of problems when operating at scale". He also echoed the caution from his talk that adopting this architecture style does bring new challenge, particularly in relation to observability. As the late majority become aware of the concept of microservices, more and more organisations are choosing to build microservice-based systems, and this is why we are seeing the emergence of the next generation of monitoring and observability tooling.
When asked his opinion on the value of service graph visualisations, he responded by stating that a depiction of an entire microservice graph is probably only useful for new employees when they first start, and that generating a graph of a large system may also be impractical. Instead, what is more valuable is visualising subsets of this graph which highlight areas of interest and superimpose information about potential problems.
He also cautioned that simply showing, for example, latencies on such as service graph can be problematic (and potentially meaningless) -- the context for the latencies should also be provided (e.g. is only a specific geographic region or subset of users affected), and most importantly, the visualisation should indicate if the latency is on the critical path. One of the associated pieces of work being undertaken at LightStep is to automatically detect critical issues like latency on the critical path, and to provide insight (and context) into such issues for operators to explore and take action against.
The video for the talk "Three Pillars, Zero Answers: We Need to Rethink Observability" can be found on the CNCF's YouTube channel.