Over the years we've discussed a lot about the challenges of implementing and deploying microservices and one common thread throughout has been how to monitor what is happening within a distributed application constructed from microservices, and particularly as the complexity rises and the notion of choreographies becomes even more important. For example, early in 2017 when we had our Microservices In Practice Virtual Panel, several of the participants had something to say about monitoring when asked about their top five do's and don't's, with Martin Verburg stating:
Build three prototype services that communicate with each other and figure out how to do all of the non functional requirements like security, service discovery, health monitoring, back pressure, failover etc., *before* you go and build the rest.
And when asked about particular languages or technologies to recommend when building microservices, Adam Bien said:
Java is 20 years old, mature, and comes with unbeatable tooling and monitoring capabilities. At the very beginning, Java already incorporated microservice concepts with the Jini / JXTA frameworks mixed with no-SQL databases like e.g. JavaSpaces. As often -- Java was just 15 years too early. The market was not ready for the technology back then. However, all the design principles from 1999 still do apply today. We don't have re-invent the wheel.
Over the last year or so, Linux containers and microservices have often become synonymous and this has also impacted how people think about monitoring. And recently, in the vein of predictions for the coming year, Péter Márton, CTO of RisingStack, had a few things to say on the subject. He starts by covering some of the basics:
Current APM (Application Performance Monitoring) solutions on the market like NewRelic and Dynatrace rely heavily on different levels of instrumentations, this is why you have to install vendor specific agents to collect metrics into these products. Agents can instrument your application at various places. They can extract low-level language specific metrics like Garbage Collector behavior or library specific things like RPC and database latencies as well.
But then cautions about going down the APM route too much or too quickly, ending with a prediction on this:
Having vendor specific agents can also lead to a situation when developers start to use multiple monitoring solutions and agents together as they miss some features from their current APM solution. Multiple agents usually mean multiple instrumentations on the same code piece, which can lead to an unnecessary performance overhead, false metrics or even bugs. I think that the trend of using vendor-specific agents will change in the future and APM providers will join their efforts to create an open standard for instrumenting code. The future could lead to an era where agents are vendor-neutral, and all values will come from different backend and UI features.
Márton then jumps to the related topic of distributed tracing because, in his view, the emergence of containers and microservices has driven developer need to enhance the art of observability in order to monitor and debug. We've also touched on distributed tracing technologies in the past, such as Zipkin, and most recently from Cindy Sridharan on observability:
Logging, metrics and request tracing are the mainstay of observability. Logs provide additional context to data like metrics. However, logs are also costly in terms of performance. In contrast, metrics have a constant overhead and are good for alerts. Taken together, logs and metrics are good for insights into individual systems, but make it difficult to see into the lifetime of a request that has traversed multiple systems. This is pretty common in distributed systems. Tracing offers the ability to track a request as it moves through various systems.
And Márton agrees, discussing some examples in his own article before moving on to discuss the OpenTracing effort and its importance, explaining how the members want to provide a:
[...] standard, vendor-neutral interface for distributed tracing instrumentations. OpenTracing provides a standard API to instrument your code and connects it with different tracing backends. It also makes it possible to instrument your code once and change the Tracing backend without trouble anytime.
He has some examples of using OpenTracing within the original entry and particularly from the perspective of Node.js development. He concludes with an emphasised statement and perhaps a request: "I expect more and more standardized solutions for instrumentation in the future, and I hope one day all of the APM providers will work together to provide the best vendor-neutral agent."
However, he has more to say on OpenTracing and how it can work with ElasticSearch and Prometheus, including some examples and illustrative graphs showing the power and possibilities of infrastructure topology visualisation, which as Márton points out, helps in the understanding of correlation during incidents. Furthermore, he references a Node.js Metrics Tracer project from RisingStack which he says they can use to:
[...] reverse engineer the whole infrastructure topology based on this information and visualize the dependencies between services. From these metrics, we can also gain information about throughput and latencies between applications and databases in our microservices architecture.
Back in early 2016, we interviewed a number of people for The Challenge of Monitoring Containers at Scale article, and at the time Dynatrace chief technical strategist Alois Reitbauer had this to say on the related aspect of understanding and using the data collected from tracing:
[...] everybody must be able to understand monitoring data. This is why we invested a lot of time into building self explanatory infographics everybody can understand. Another key requirement is anomaly detection. Due to the massive scale nobody can look at all these numbers manually. So monitoring systems have to learn normal behaviour and indicate when system behaviour is not normal any more. The last aspect is contextual semantic information. An example is that a monitoring system needs to “understand” what a metric means and how it is related to other metrics. We need to learn all the dependencies in an application landscape, and this information can then be used in problem analysis.
Márton concludes his article with the following prediction:
To put microservices monitoring and observability to a next level and bring the era of the next APM tools, an open, vendor-neutral instrumentation standard would be needed like OpenTracing. This new standard needs to be applied by APM vendors, service providers, and open-source library maintainers as well.