BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Why Change Intelligence is Necessary to Effectively Troubleshoot Modern Applications

Why Change Intelligence is Necessary to Effectively Troubleshoot Modern Applications

Key Takeaways

  • The complexity of microservices systems, and the cost impact of outages, requires greater investment in rapid remediation and root cause identification, to decrease downtime.
  • Observability is more than just collecting metrics, logs, and traces. In order for that data to be useful, it must be connected to business requirements.
  • Monitoring is static and observability is dynamic. The old world of pre-defining thresholds for alerts is evolving to a world where real-time problems require real-time solutions that weren’t thought of in advance.
  • Change Intelligence is the missing component within most organizations today. The ability to understand not only when something changed -- but why it changed and who changed it -- requires engineers to be able to intelligently correlate monitoring and observability data to arrive at the root cause of an incident much more rapidly.
  • Telemetry provides the building blocks that enable change intelligence to identify and map the root cause, based on changes in the system and their broader impact.

 

Microservices and highly distributed systems are extremely complex. There are many moving parts including the applications themselves, the infrastructure, versions, and configurations. Often, this results in difficulties tracking what is actually in production or other development environments (QA, Development, Pre-Prod), which in turn becomes problematic when you need to troubleshoot your systems. 

In this article, I will provide some clarity around the different use cases for monitoring and observability, when each is relevant, and how to use them properly. I will then focus on Change Intelligence, which is a new way of approaching telemetry like metrics, logs, and traces that’s actionable, in order to troubleshoot incidents effectively in real-time. To date, observability has been focused on aggregating relevant data about your systems, and monitoring has been the standardized checks to validate that everything is working properly based on this data. Change intelligence constitutes an augmentation to existing telemetry, providing more context in a world where everything is distributed and companies now have tens, if not hundreds, of services all communicating with each other. This makes it possible to correlate between recent changes, and ultimately understand their impact on your systems as a whole.

The Observability & Monitoring Puzzle

Let’s start by getting on the same page about monitoring and observability, and how it is often implemented today in engineering organizations with complex microservices operations. 

You’ll often hear folks say that observability is the sum of metrics, logs and traces -- but the truth is this telemetry is just the prerequisite to doing observability properly. In order for the data to actually be usable, you need to ensure that it is connected to your business requirements and the way your application works. 

When thinking about monitoring, what comes to mind for most people are dashboards. They can be very pretty, but most developers don’t really want to spend their entire day looking at them. These are most often static alerts that require developers to pre-determine what can possibly be wrong. 

An interesting trend I am seeing is to actually leverage observability tools to find the root causes of problems, and then baking that learning into the monitoring tools to watch out for the problem repeating in the future. If you have proper documentation and runbooks, the union between monitoring and observability can do wonders for your reliability program.

These checks can be anything from the simple example of making sure a certain system’s latency is below a certain threshold, or a more complex example of checking that a complete business flow works as expected (adding item to cart, and checking out successfully). You may believe that you already have the whole picture of what’s going on in your systems by having both monitoring and observability in place, but you’re still missing a critical piece of the puzzle that enables you to see the full and complete picture. 

The Missing Piece: Change Intelligence

To be able to truly gain the insights you require from your systems when problems arise, you need to add another piece to the puzzle - and that is Change Intelligence. Change Intelligence includes not only understanding when something has changed, but also why it has changed, who changed it, and what the impact of the change has had on your systems. 

The existing onslaught of data is often overwhelming for operations engineers. Therefore, Change Intelligence was introduced to provide a wider & broader context about the telemetry, and the information that you already have. For example, if you have three services talking to each other, and one of these services has an elevated error rate, this is a good indicator that something is wrong according to your telemetry. 

This is an excellent basis for suspecting something is wrong in your system, however the next and more critical step will always be to start digging to find the root cause that is the reason behind this anomalous telemetry data. Without Change Intelligence, you are going to have a myriad of problems understanding the root cause, even if you are following a step-by-step guide that you've already worked with because, eventually, every incident is unique. 

So what does change intelligence consist of?

To begin with, if you properly implement change intelligence, it should include all the context related to which service is communicating with which service (as in a service map), how changes made in Service A can impact other services (due to dependencies, both downstream and upstream), configuration changes made not only on the application level, but also on the infrastructure layer or cloud environment, versions that are not pinned causing a snowball effect which are not in your control, cloud environment outages, and anything that can impact your business continuity. 

So just to tie all three of these disciplines together:

  • Observability gives you the data in the format that aligns with your business needs;
  • Monitoring is a set of checks to ensure your business is running as it should;
  • Change intelligence correlates all of this information to enable you to get to the root cause of an incident much more rapidly, by understanding what/why/who/how things changed in your systems.

Troubleshooting Incidents in Practice

As someone who previously worked in a monolithic environment before making the move to microservices, I have firsthand experience with the tremendous difference between these types of environments. Whereby monitoring and observability are nice to have in monolithic systems, with microservices, they are a complete necessity.

Trying to troubleshoot multiple services that have different purposes and perform different tasks in your systems is extremely complex. These smaller services are usually split into smaller chunks, and are running several operations at a time, therefore requiring a constant communication between them. 

So for example - when you get an alert (usually in the form of a page or a message on Slack) that notifies you that a part of the business is not properly working—oftentimes, in real large-scale distributed environments, this can be attributed to any of a number of different services. It’s never immediately apparent which service is failing without proper monitoring and observability. These help you understand where, in this pipeline of microservices, the issue lies, and what component specifically is failing.

When you are in the throes of an incident, you’re going to spend most of your time troubleshooting by trying to understand the root cause of the issue. You’ll start by trying to figure out where the issue has occurred among your hundreds of apps or servers, and then once you isolate the failing service or application, you’ll want to understand what exactly happened. This assumes a few prerequisites that aren’t always fulfilled:

  1. You have the required permissions to all of the relevant systems
  2. You understand the entire stack and all of the technologies within all these systems
  3. You have the experience required to understand the issue sufficiently to solve it

As a DevOps engineer (today at Komodor, formerly at Rookout), I have encountered these types of scenarios regularly, so here’s a brief story from the trenches.  I remember an incident where my team and I started to receive a lot of errors coming from a key service in our system [spoiler: the bottom line was we received numeric values and when trying to insert them into our database, the column types did not match]. 

The only error information we had to work with was: invalid value. We then had to scour our system and recent changes to try and understand the data and errors we were dealing with - only for us to spend a whole day researching the error and to finally come to an understanding that this was a change that was implemented seven months prior. The column type in the database was integer, while we were trying to insert numbers that were bigger, hence requiring a biginteger data/column type. Without any platform or system to help us correlate these errors to the relevant changes, and with configurations that happened over seven months ago, even a simple thing like too large of a number can take out a whole experienced team for an entire day when we didn’t have change intelligence to help us out.

So - with that example in mind, let’s look now at some additional examples where I’ll display how troubleshooting would look like with and without Change Intelligence involved. 

Observability Example

In this example, you will find a dashboard that shows the # of requests, # of errors, and latency of response time.

Monitoring Based on Observability

Next monitoring comes in, ingests this data, and then provides an understanding of the proper threshold that will check whether this is acceptable, as well as in the context of its history.

This is the result of a ‘query’ that is periodically checked vs the historical and live data, to alert on any activity above 2% errors, e.g.: 

avg(last_5m):sum:trace.authorization.worker.handle.errors{env:production,service:authorization,resource_name:web} by {resource_name} / sum:trace.authorization.worker.handle.hits{env:production,service:authorization,resource_name:web} by {resource_name} > 0.02

What Happens Next without Change Intelligence:

Without a Change Intelligence solution, you’ll receive an alert from DataDog (based on the example above), indicating that the 2% error threshold was exceeded. You begin wondering “why did this happen?” and draw up some theories such as: the app code may have been changed, network issues, cloud provider or 3rd party tool issues or it may even be related to another service which is having issues of its own. In order to find the actual answer, you’ll need to scour through many metrics and logs, and then try to piece together what happened in order to find the root cause without much indication of what, where, and how this happened since this kind of data is lacking in current monitoring & observability tools.

Monitoring & Observability with Change Intelligence:

You will receive the alert from DataDog, but the difference here is that your next troubleshooting steps will be significantly easier since you have a Change Intelligence solution that already provides you the necessary context about all of the above theories that you may have. Using a Change Intelligence solution as your single source of truth, you can immediately see what changed in the recent history, correlate the changes to what potentially could have impacted the service (e.g. code changes, config changes, upstream resources, or a change in a related service), and then quickly arrive at the root cause, instead of scouring through multiple solutions, and their logs & metrics, and trying to piece it together to find the needle in the haystack.

This Change Intelligence can be based on release notes, audit logs, version diffs, and attributions (who made the change).  A correlation map of this change is then cross-referenced to a myriad of different connected services to find the most likely culprit for the failure, enabling much more rapid recovery.

By providing data in the format of a timeline, and service map, and not just a dashboard with thresholds and limits, greater context of the overall system is possible.

The screenshot above shows an example of a Change Intelligence solution for K8s (Komodor) which indicates an alert triggered by DataDog. Now you have a timeline showing all changes that have occured in a specific service prior to the issue, so you have the relevant context to get faster to the root cause. 

As is shown in the screenshot above, we can use this information to determine the root cause of an issue a lot quicker by following the trail back from our starting point of the Datadog monitor triggering and seeing what happened or changed. In this simple case, just before the Datadog alert was triggered, we can see that there was a Health change event indicating that there were not enough ready replicas available for this application. Before that, a new version of the app was deployed. It might be that during the deployment, availability wasn’t guaranteed, or that code changes affected this app and introduced a bug or major change. By zooming in on the details of that deploy, we would be able to see in seconds what was the reason for this alert to be triggered.

When the Data + Automation Just Isn’t Enough

Systems are becoming increasingly complex, with many different services, applications, servers, infrastructure, versions and much more, and all of this at a scale that was previously unheard of. Tools that got organizations here, may not be enough to power tomorrow’s systems and stacks.

If, once upon a time one had logs, then entered tracing, and after that metrics, these were all brought together into dashboards that were constructed to provide us with visual indications to our operations health. With time, more and more tools have been added to the chain to help power and manage the influx of enormous amounts of data, alerts, and information.

Change intelligence will be a critical piece to enable future stacks and provide an added layer of actionable insight on top of existing monitoring & observability tools. This approach will ultimately make the difference between rapid recovery and upholding today’s ironclad SLAs, and costly, painful and potentially long downtime.


 

About the Author

Rate this Article

Adoption
Style

BT