Key Takeaways
- The term observability covers a spectrum of information being emitted by a system. The spectrum goes from detailed logs of events or traces, to traditional monitoring events or aggregated stats.
- Aggregating logs from diverse components or services that make up a running system provides an excellent way to monitor, debug and understand modern software systems.
- Cloud and container technology provide a lot of advantages, but at the cost of “understandability” -- which is maybe the best way to view the term “observability”.
- Traditional monitoring often discards metrics over time. This works well if the system being monitored is mature and well understood; but this is usually not the case when you are building a system in the first place.
- As you get to know your logs, engineers will develop metrics -- aggregate queries -- that are important for the health of the system. These are the queries that make up dashboards and the input data for alerting.
- For debugging or incident response, you need a system that makes it easy to do ad-hoc queries; it is important to have a logging solution that does not impose a schema on what you log.
- Logs naturally evolve from a more verbose level, to a more structured and better information-to-noise ratio level. Neglecting to cultivate this evolution this is an anti-pattern.
- The future impact of Artificial Intelligence (AI) and Machine Learning (ML) in the logging space will likely be big. For now, focus on getting logs at the fingertips of developers to let them interact with them and employ the human intelligence to provide interpretation.
- AI/ML generally requires a baseline to be able to identify outliers, and as a system that generates logs stabilises the logging platform will be able to provide this baseline.
InfoQ recently sat down with Kresten Krab, CTO at Humio, and discussed the role of logging within the overall topic of system observability. Krab began by stating that cloud and container technology provide a lot of advantages, but at the cost of “understandability” -- which is potentially the best way to view the term “observability”. The discussion covered many topics, but a key theme is that aggregating logs from diverse components or services that make up a running system provides an excellent way to monitor, debug and understand modern software systems.
The full transcript of the interview can be read below:
InfoQ: Hi Kresten, many thanks for speaking to InfoQ today. Could you introduce yourself, and also say a little about your current work at Humio please?
For the last two years I’ve been CTO at Humio, a startup that we launched to provide a better way for DevOps teams to understand their systems. 20 years ago I co-founded Trifork which is now a +400 employee bespoke software solution provider, where our mission has always been to help other companies succeed with new technology.
I’ve been involved with teams implementing new technology, training, and building conferences to spread the knowledge. As part of this we’ve seen the ever-increasing complexity of software systems being built all the way from the first web-enabling projects in the late 90s to today's complex cloud solutions, and I've seen the struggle in teams trying to understand, debug, and monitor their systems in production.
So we observed that aggregating logs from diverse components or services that make up a running system provides an excellent way to both monitor, debug and understand these systems. At Humio, we refer to this as the ability to “feel the hum of your system”. Logs are a great “lowest common denominator” point of integration for understanding a system because logs are already there. You don’t need to augment existing systems to make them generate logs: they are already generating logs and you just need to gather them and put them on a shared timeline of events.
We have found that existing providers of log management tools require you to limit your logging — whether it is costs, quota limitations, complexity or performance — and we thought we could do better. So you can say we’re on a mission to democratise logging. Humio is the product we’re building from the ground up to let everyone share this insight.
InfoQ: There has been some discussion recently about monitoring versus observability, and so how does logging relate to this topic?
We welcome this discussion very much. The term “observability” fits well with our mantra of “feeling the hum of your system”. I don’t think there is a “versus” discussion here; the term observability covers a spectrum of information being emitted by a system. The spectrum goes from detailed logs of events or traces (being the most verbose and rich in information) to traditional monitoring events or aggregated stats. You can derive the stats from the events, but not the other way around. An excellent read on this spectrum of information is also Cindy Sridharan’s blog post on “Monitoring in the Time of Cloud Native”.
So, if you’re only gathering metrics as in the traditional way of monitoring, then you’re throwing information away. This works well if the system being monitored is mature and well understood; but that is usually not the case when you’re building a system in the first place. In other words: you often don’t know what will be causing calamities, so having a richer base to search is a huge advantage. It enables you to go back in time and search for patterns that you only now realise are important.
Our vantage point is that all these sources of information fit well into a time-series text data storage with a rich query capability, and that it is a huge advantage to be able to process both event-style information and metrics-style information in a shared tool without having to limit yourself to debug and understand a system in terms of what you thought was important. If you’ve ever had a “I wish I indexed that” moment, then you know what I mean.
InfoQ: Can you explain a little about how operational and infrastructure logging has evolved over the last five years? How have cloud, containers, and new language runtimes impacted monitoring and logging?
Software nowadays is no longer a single body of code you can build and test in isolation. Cloud, containers and all this tech obviously provides a lot of advantages, but at the cost of “understandability” (which is maybe the best way to view the term “observability”). The system components are increasingly scattered and remote, and less likely to be under your direct control. This evolution goes hand in hand with devops movement which has changed the way people think about software. There are a lot of teams now who now have “a system they care about”, as opposed to just building a piece of software and “throwing it over the wall to ops”.
So “understanding” the behaviour of your software system is now largely only possible in the wild. Most software systems are a composition of other systems that out of your control. Think of your software system as an autonomous car: it has to be put on the road to be tested and improved, but in many ways we’re still building software as if we could test it in the lab. This is the impact of cloud and containers on our software systems, and I think we have to come to terms with this to deal with it head-on.
InfoQ: How have new architectural styles, such as microservices and Function-as-a-Service (FaaS) -- which are in effect distributed systems -- impacted logging?
In terms of exposing the resource consumption of the individual components, and being able to improve parts of your system individually I think these are a win. But in terms of understanding your system as a whole — in particular if you don’t capture and centralise your logging, these mostly contribute adversely to the big picture because information is scattered across diverse platforms and components.
It is daunting to speak for logging in general, but I can say what we do at Humio in this space. For platforms such as DC/OS, Mesos, Kubernetes, Heroku, CloudFoundry, AWS, etc. we provide integrations that make it simple to grab all the logs and put them in one place. On each of these platforms, logging is more or less done in an uniform way, and that lets a logging infrastructure capture a wide range of logs without a lot of configuration. So with these architectural styles where you your system in a shared infrastructure you can now get the logs as a side effect of using the platform, which simplifies getting access to them.
InfoQ: What types of query are engineers typically making of logging systems, and how are modern logging platforms adapting to this?
We see our users going through an evolution: At first they make free text searches, using the logging platform as a search engine for their logs. But then quickly the focus changes to extracting information from text and building aggregations over that extracted data.
As you get to know your logs, engineers will develop metrics — aggregate queries — that are important for the health of the system. These are the queries that make up dashboards and the input data for alerting. For a majority of systems you can live with these aggregate stats being computed from the logs as opposed to be built into the system itself as a “monitoring metric”.
For debugging or incident response, you need a system that makes it easy to do ad-hoc queries; which makes it important to have a logging solution that does not impose a schema on what you log. In these situations, we generally see engineers asking questions about things that they did not “think about up front”.
An interesting thing is the feedback loop that happens when developers realise that they can interact with the logs. You see that logs evolve incrementally and becoming more structured as you try to debug or improve your system. The word “incremental” is important here, because you cannot build the perfect set of logs for your system from day one. So you end up refactoring your logs: new subsystems logs more verbosely, and as a subsystem matures you tend to reduce the information to noise ratio in the logs. So, your logging platform should be able to cope with this diversity.
InfoQ: What is the most common logging antipattern you see? Can you recommend an approach to avoid this?
Well, the thing that hurts my heart is to hear stories of someone unable to log (or god forbid being able to access your logs) because of quota, cost or company policy. We see customers who reduce their logging by sampling data at ingest. This can be necessary at scale, but should be avoided as far as possible.
I mentioned the refactoring of logs before. Logs naturally evolve from a more verbose level, to a more structured and better information to noise ratio level. This process is like weeding your garden, and I’d consider it an anti-pattern to neglect doing this.
InfoQ: What role do you think QA/Testers have in relation to the observability of a system, particularly in relation to logging?
Logs are super useful for testing and QA. As part of our own automated tests and CI setup, we capture logs from the builds and run queries over these as part of acceptance, as well as reporting and alerting for the builds. In this way, you can use log aggregation to construct integration tests as well as a means to improve performance and general quality of your tests.
InfoQ: What will the impact of AI/ML be on logging, both in regards to implementing effective logging and also providing insight in issues (or potential issues)?
I think this will likely be big. For now, we focus on getting logs at the fingertips of developers to let them interact with them and employ the human intelligence to provide interpretation. AI/ML generally requires a baseline to be able to identify outliers, and as a system that generates logs stabilises the logging platform will be able to provide this baseline. The richness of logging and the high entropy in logs do provide a challenge for both AI and ML, as they tend to better in low-dimensionality setting.
I think that believing that a logging system can automagically detect outliers in arbitrary logs is an impossibility. You need some sort of interaction where users of the system extract and generalise information in the logging flow that is deemed interesting for outlier detection. For particular kinds of logs this may be achieved more or less automatically, but in the general case you will need some kind of data scientist’s capacity to choose what to look out for.
InfoQ: Thanks once again for taking the time to sit down with us today. Is there anything else you would like to share with the InfoQ readers?
Thank you too. Feel free to swing by humio.com, and try our SaaS solution or ask us to how to run Humio on your own gear. We are always keen to discuss these ideas about logging in more depth!
About the Interviewee
Kresten Krab provides technical leadership and vision at Humio. In his previous role as CTO of Trifork, Kresten was responsible for technical strategy and has provided consulting advice to teams on a variety of technologies to include distributed systems and databases, erlang, Java, and mobile application development. Kresten has been a contributor to several open source projects, including GCC, GNU Objective-C, GNU Compiled Java, Emacs, and Apache Geronimo/Yoko. Prior to Trifork, Kresten worked at NeXT Software (now acquired by Apple), where he was responsible for the development of the Objective-C tool chain, the debugger, and the runtime system. Kresten has a Ph.D. in computer science from University of Aarhus.