Cindy Sridharan summarizes her thoughts on observability and its relevance in monitoring cloud native applications in her recent article. Observability is a philosophy that encompasses monitoring, log aggregation, metrics and distributed tracing to gain deeper, ad-hoc insights into a system.
Sridharan’s article is based on her Velocity talk on the same topic. The way we build systems has changed with the advent of microservices, cloud and containerized architectures. Applications are distributed and ephemeral in the latter case. The underlying infrastructure and networking services have become increasingly robust, leaving the application layers to catch up. The former is provided by cloud and system software vendors. Most failures in the future will come from the application layer or from complex interactions between different applications.
This complexity leads to challenges in gaining visibility into the state of the system. There are newer tools emerging but they are still in the formative stages. Sridharan explores the concept of observability as well as how to choose the right tool for gaining insights into today’s systems.
Observability is a term which has seen popularity in the monitoring community in recent times, however, it’s not new and there seems to be some disagreement about what it actually means. As Sridharan says in a previous article, observability can be considered to be a superset of monitoring. The Twitter engineering team’s articles summarize observability as monitoring, alerting/visualization, distributed systems tracing and log aggregation and analytics. Google focuses on simplifying instrumentation, making data aggregation cheap, and standardizing formats and frameworks for tracing across the stack. The top-level abstraction for Google is context propagation, which is provided as a library per language or uses the language’s built-in features. The context is used to propagate key-value pairs called tags across the stack which can be later used to filter out specific requests.
Alerting and single-pane-of-glass dashboards are part of monitoring. According to Sridharan, observability is all these plus profiling (of applications), debugging and dependency analysis. Observability, in contrast to monitoring, is about data exposure, finding answers to questions and easy access to information. Monitoring is about detecting failures and has well-defined failure paths. The ability to determine the actual cause is lost with increase in the number of failure modes, which is a consequence of increasingly complex architectures. The latter is becoming the norm. There are different ways to define observability. For example, whitebox monitoring is a source of data, whereas observability can be seen as the ability to mine relevant information from the data.
Logging, metrics and request tracing are the mainstay of observability. Logs provide additional context to data like metrics. However, logs are also costly in terms of performance. In contrast, metrics have a constant overhead and are good for alerts. Taken together, logs and metrics are good for insights into individual systems, but make it difficult to see into the lifetime of a request that has traversed multiple systems. This is pretty common in distributed systems. Tracing offers the ability to track a request as it moves through various systems. It is hard to introduce later, one reason being third party libraries that are used by the application also need to be instrumented. Traces are sampled to reduce overhead and storage costs. Sampling here means reducing the amount of information collected. Some best practices for logging include enforcing quotas and dynamic rate of adjustment of log generation.
Recent developments in tracing include Google’s Dapper paper and the open source implementation inspired by it called Open Zipkin, which led to the OpenTracing standard. Tracing is made easier if the applications use a service mesh like the Envoy project. A service mesh is a network infrastructure layer that sites above the TCP/IP layer and can handle reliable delivery of requests and is sometimes implemented as a set of network proxies. It simplifies service communication in dynamic environments like containers clusters orchestrated by Kubernetes.
In a cloud-native environment, software development and delivery can benefit from adopting practices like pre-production testing, testing in production, effective monitoring, exploration of raw data like metrics and log events, and dynamic instrumentation.