Key Takeaways
- Treat argues that as systems become more distributed, more elastic, and more complex, a shift towards Observability is required as dashboards and predefined questions are no longer sufficient.
- While monolithic applications can be as complex as microservice-based architectures, microservices tend to shift the complexity from code to operations and architecture.
- When introducing observability practices, Treat finds the two most common mistakes are chasing tooling and trying to implement a "single pane of glass".
- Observability has to start with culture. A culture of observability must be promoted through treating instrumentation as a first-class citizen. Teams must own and be responsible for the operations of their services.
- One potential evolution for traditional operations team is to move into Developer Enablement by applying a product mindset to operations to provide tooling and services that improve the developer experience.
In his latest article on Microservice Observability, Tyler Treat, managing partner at Real Kinetic, attempts to disambiguate the concepts of Observability and Monitoring. In the article, he discusses how static systems tend to exist in one of two states: up or down. This makes monitoring easy as we can tie tooling to report on that state. However, complex systems can tend to exist in a number of potential states and therefore require a more discovery-based approach that can no longer rely on predefined questions and dashboards.
He highlights that one of the core differences between monitoring and observability is primarily one of post-hoc versus ad-hoc. With monitoring, the tendency is to answer predefined questions; this tends to be the known-unknowns that we know how to look for. In his definition, observability is required wherever we do not have the same level of data available to formulate our predefined questions. This puts us into an unknown-unknown state in which discovery is the primary method of approach.
InfoQ recently sat down with Treat to discuss the topics of observability and monitoring.
InfoQ: Why do you think monitoring and observability are becoming conflated in our discussions on this topic?
Tyler Treat: There are a few factors. First, monitoring and observability are closely related. Both are important and necessary for operating large-scale systems. Dashboards and predefined questions are still a key part of that. What has changed is that our systems have become more distributed, more elastic, and more complex. This means that those dashboards and predefined questions are no longer sufficient - thus the rise of observability. So the second factor is simply the fact that observability is a relatively new concept that has emerged.
Finally, this is still an early and evolving space ("space" being used in a very broad sense here in reference to cloud-native systems, microservices, DevOps, and other related ideas). Unlike many other engineering disciplines, there is nothing scientific about the concepts of monitoring and observability as they are typically applied to most software systems. We’re making it up as we go. There is theory we can lean on from other disciplines, but that hasn’t really happened. It’s no wonder ideas get conflated - they aren’t really rigorously defined in the first place!
InfoQ: You talk about Static Monolithic Architectures and how that is a fairly understood problem with respect to monitoring, and how with Elastic Microservice Architectures traditional solutions are not sufficient. Is the addition of Monolithic and Microservice a necessary component? Instead, it appears that the shift towards observability is an emergent feature of complex systems which tends to arise with more elastic, fault-tolerant architectures.
Treat: I think you hit the nail on the head. The shift isn’t so much about static-monolithic versus elastic-microservice, so much as it’s about differing complexity levels. But you can have complex monoliths just like you can have complex microservice architectures, so why has the latter brought about this shift toward observability? The reason is that while monoliths can have internal complexity, microservices bring that complexity to the surface. It shifts from being code complexity to operations and architecture complexity, and that’s a whole different type of challenge.
InfoQ: What are common mistakes you see organizations making as they begin exploring introducing observability practices into their systems? Can you recommend any approaches to avoid these?
Treat: A common misstep I see is companies chasing tooling in hopes that it will solve all of their problems. "If we get just one more tool, things will get better." Similarly, seeking a "single pane of glass" is usually a fool’s errand. In reality, what the tools do is provide different lenses through which to view things. The composite of these is what matters, and there isn’t a single tool that solves all problems. But while tools are valuable, they aren’t the end of the story.
As with most things, it starts with culture. You have to promote a culture of observability. If teams aren’t treating instrumentation as a first-class concern in their systems, no amount of tooling will help. Worse yet, if teams aren’t actually on-call for the systems they ship to production, there is no incentive for them to instrument at all. This leads to another common mistake, which is organizations simply renaming an Operations team to an Observability team. This is akin to renaming your Ops engineers to DevOps engineers thinking it will flip some switch. There needs to be a culture of ownership and responsibility - that’s really all DevOps is about - but changing culture is hard.
However, that culture of ownership and responsibility often causes a pendulum swing too far in the other direction. I’ve seen teams given SSH access to production servers in the name of DevOps. After all, if the team is on the hook, they need to have free rein access, right? This is a dangerous cultural norm for a number of reasons. For one, security- and compliance- minded folks would shudder at the notion, and rightfully so. Even SSH access to staging and demo environments can be dangerous as it relates to observability. This is because it gives developers a shortcut when debugging or identifying a problem.
If you can always SSH into the box, attach a debugger, or directly query the database to track down an issue, you’re going to be less incentivized to properly instrument your system or build support tooling for it. If you’re conditioned to solve problems this way in pre-production, what happens when something goes wrong in production where you can’t rely on the same techniques? Your operations instincts atrophy because you rely on a crutch. This is a phenomenon I call "pain-driven development," and it can be dangerous if left unchecked.
As a result, one practice I encourage is chaos testing or "gameday exercises." The value of these exercises isn’t just to see how your system behaves in bad weather, but also to identify gaps in your monitoring and observability and to develop your operations instincts in a safe environment.
InfoQ: In your Observability/Monitoring spectrum model, you call out that Monitoring is about Hypotheses and Observability is about Discoveries. Where do you see the other two categories you defined, Assumptions and Facts, fitting into this?
Treat: Assumptions and Facts are what inform our monitoring and observability practices. For example, let’s say we have a runtime with a memory limitation of 512MB. This is a known known or a "fact" using the mental model I described. This known known informs how we might monitor memory utilization for the system, such as alerting when we’re within 90% of our max memory. Similarly, I think the unknown knowns or "assumptions" are important to be cognizant of because they can bias our thinking, such as preventing us from exploring a certain avenue when debugging a problem or monitoring a particular behavior.
InfoQ: The transition from monolithic to a more microservice architecture is an undertaking many companies are tackling. At what point during this transformation do you recommend organizations begin creating their observability pipeline?
Treat: From the start. That needs some qualifying, however. An observability pipeline should be an evolutionary or iterative process. You shouldn’t waste time building out a sophisticated pipeline early on; you should be focused on delivering value to your customers.
Instead, start small with items that add immediate value to the observability of your systems. Something you can begin doing today that adds a ton of value with minimal lift is adopting structured logging. Another high-leverage thing is passing a context object throughout your service calls to propagate request metadata which can be logged out and correlated. Next, move the log collection out of process using something like Fluentd or Logstash. If you’re not already, use a centralized logging system - Splunk, Elasticsearch, Sumo Logic, Graylog - there are a bunch of options here, both open source and commercial, SaaS or self-managed. With the out-of-process collector, you can then introduce a streaming data pipeline to decouple log producers from consumers. Again, there are managed options like Amazon Kinesis or Google Cloud Pub/Sub and self-managed ones like Apache Kafka. With this, you can now add a variety of consumers and log sinks. At this point, you can start to unify the collection of other instrumentation such as metrics and traces.
We’re starting to see the commercialization of this idea with products like Cribl, and I think this will only continue as people begin to hit the limitations of traditional APM tools in a microservice environment.
InfoQ: You mention that with an elastic microservice architecture, the system is potentially in "one of n-factorial states" and that "Integration testing can’t possibly account for all of these combinations." What role do you see for integration testing in this new world?
Treat: Testing strategies start to change. Integration tests can still have value in this new world, but in more limited capacities like smoke testing. I’ve worked on large, complex systems that had extensive integration tests that took hours to run and were aggravatingly flaky. The problem with tests is that they accumulate over time. The typical sequence of events is: something bad happens, a fix is introduced, and a new test is added. Rinse and repeat. These tests accumulate, but as time goes on, they actually become less and less relevant. Interestingly, the same thing happens with monitoring and dashboards. Dashboards are operational scar tissue. It’s important to periodically reevaluate the purpose and value of them - the same is true of tests (and organizational processes!).
With microservices, I think contract testing starts to become more important and, in particular, consumer-driven contract testing. This tends to be a much more scalable and reliable approach to testing large-scale, distributed systems.
InfoQ: In a previous article you advise that while a single Ops team may not have enough context to troubleshoot distributed systems, these groups should be responsible for providing the tools and data teams need to operate the systems. Isn’t there a challenge in a centralized team not having the correct context or specialization within this system to be able to create the correct tooling or know which data may be required?
Treat: This is a great question because it gets at an important notion: what is the future of Operations? Today, organizations adopting DevOps practices are faced with the challenge of balancing developer autonomy and empowerment with chaos and duplication of effort, among a number of other concerns. Centralized Operations provides benefits like specialization of roles and standard patterns around things like reliability, security, and disaster recovery. Now developers are being given the freedom to take matters into their own hands. But for many of them, these non-functional requirements are an afterthought - something Ops or security folks do - or worse, they aren’t even aware of them. At the same time, centralized Ops no doubt creates an innovation and delivery bottleneck in addition to misaligned incentives in terms of software operations. So how do we reconcile the two?
DevOps can be implemented in many different ways. For example, Google does SRE. But there is no "right" way to do it - every company is different. You have to take ideas, understand the context, and apply them to your own situation (or throw them out) as appropriate. One idea I have found consistently effective, however, is applying a product mindset to operations. This is what I’ve come to call Developer Enablement. As I hinted at in the post you mentioned, this is the idea of enabling teams to deliver business value by providing them with tools, automation, standards, and APIs that codify common patterns and best practices. How does this differ from traditional centralized Operations and how do we build the right tooling? This is where the product mindset comes into play.
First, building products is intended to empower customers - in this case, development teams - to deliver value with minimal external dependencies while allowing the Developer Enablement team to get out of the way. Traditional Operations, on the other hand, is normally directly in the critical path. For example, you file a ticket and some nameless Operations person carries out the work in a black box. This becomes a bottleneck and source of friction.
Second, customer experience is an important aspect of the product mindset. This means building products that developers will not only want to use, but would miss if they could no longer use them. This requires working closely with teams during the discovery and development of products, and receiving constant feedback from them to understand what’s working well and to identify areas that can be improved. This ultimately results in improvements to the overall developer experience. This should be an iterative process and is very similar to how normal customer-facing products are built.
Lastly, knowing what to build - or more importantly what not to build - is an essential part of the product mindset. Instead of trying to build products to solve what may be perceived as a general problem or trying to build a gold-plated solution, Developer Enablement teams should work with product teams that have a specific need and either build or evolve existing products to meet the team’s needs. By only building products that solve for a specific need and evolving them over time when new, common needs are identified, the Developer Enablement team is able to focus on what is essential and useful in order to provide immediate value to teams. Keep in mind that the solution might also be to buy rather than build.
InfoQ: What major trends in observability do you see in the next few years?
Treat: I think we will see the big monitoring players attempt to pivot their offerings towards observability. Initially, this has come in the form of rebranding and adjusting marketing language, but I think there are specific capabilities needed to fully implement observability, such as arbitrarily-wide structured events, high-cardinality dimensions without the need for indexes or schemas, and shared context propagated between services in the request path, to name a few. Honeycomb has truly been a leader in this space as far as enabling observability goes. I think others will be playing catch up as people start to really understand what observability means, why it’s valuable, and how it’s different from monitoring.
InfoQ: When can we expect part two? Any spoilers on what to expect?
Treat: I’ve started on it but it’s been slow going due to consulting commitments with Real Kinetic. Hopefully before the end of the year. It will provide a more concrete look at observability as well as an observability pipeline in practice. Stay tuned!
About the Interviewee
Tyler Treat is a Managing Partner at Real Kinetic where he helps companies build cloud software and ship confidently. As an engineer, he’s interested in distributed systems, messaging infrastructure, and resilience engineering. As a technical leader, he’s interested in building effective teams and software organizations. Tyler is also a frequent open-source contributor and avid blogger at bravenewgeek.com.