Key Takeaways
- Traditionally, monitoring software has relied heavily on agent-based approaches for extracting telemetry data from servers, network appliances, and processes.
- These agents have become highly commoditized due to the rapid advance of software and system design, and have failed to keep up with the needs of modern cloud-native software. They encourage vendor lock-in, and make it difficult to control monitoring costs.
- Observability requires better telemetry than agents currently provide, and that telemetry data should be a built-in feature of libraries, cloud services, and frameworks.
- OpenTelemetry is driving advances in this area by creating a standard format and APIs to create, transmit, and store telemetry data. This unlocks new opportunities and advances in monitoring, observability, and performance optimization.
I like to joke that any sufficiently advanced software can be reduced to “fancy ETL (extract-transform-load).” Admittedly, a blithe oversimplification that gets me booed out of fancy cocktail parties, but the best oversimplifications have at least a single foot in the truth.
Inspector Gadgets
This adage holds especially true in the realm of monitoring and observability; the first step on your journey to measure performance is often a shell script or installer and an awful lot of configuration files that extract performance data that it then transforms into a proprietary format, which is then loaded into a database.
Commercial and open source monitoring systems alike generally require the installation and maintenance of these agent processes on virtual machines, or injected into your process, or running as container sidecars - perhaps even all three.
These agents not only require care and feeding, such as security and configuration updates, along with a variety of business concerns such as cost management and resource tuning. Agents become a load-bearing dependency for our tech stacks, becoming so deeply wedged into our systems that we can’t imagine life without them.
For the practices of monitoring and observability to truly move forward, however, we must seek to pry these agents out of our systems. I believe that the future of monitoring is going to be agentless in order to keep up with the increasing complexity of our systems.
To understand why this is the case, though, we should first step back and understand why agents are so popular and widespread today. After that, I’ll discuss why agent-based monitoring approaches negatively impact our systems and organizations. Finally, I’ll touch on why the future of monitoring is agentless, and what we can look forward to over the next several years.
The proliferation of agent-based monitoring is perhaps a natural consequence of several trends in application development and system design.
- Advances in monitoring and observability have outpaced the replacement rate of our software systems.
- Standardization of monitoring technology has focused on vertical, rather than horizontal, integration.
- Commercial products have relied on agents as a differentiation point, disincentivizing efforts to reduce their use.
Let’s explore this in a little more detail, and discuss why each contributes to the problem.
Innovating our way into a corner
The pace of innovation is one of the biggest contributing factors to agent sprawl. The past decade has seen the rise of virtualization become eclipsed by the rise of infrastructure-as-a-service and the public cloud become eclipsed by the rise of containers become eclipsed by the rise of container orchestration and Kubernetes become eclipsed by the rise of serverless and edge computing and… well, you see my point.
Each of these technologies offers new opportunities and challenges in terms of monitoring, which has resulted in something of a monitoring arms race between incumbent vendors striving to keep up with the newest platforms and new insurgent vendors who focus on providing solutions to whatever the newest and hottest tech is. Both parties often arrive at the same logical destination, however - to incorporate data from these changing platforms and runtimes, they need to deploy something to collect telemetry.
The reality of building an observability platform is that the data format you use isn’t necessarily going to be something you can ship to your clients or build into their software, or the platforms they’re using - you need a translation layer. Agents, traditionally, have provided that layer and integration point.
It works great, just don’t touch anything.
These myriad platforms have also led to myriad applications written in myriad languages, communicating with each other over increasingly-standardized protocols. This has stymied efforts to standardize the language of performance monitoring; aka, the way that applications and platforms emit telemetry data, the query language and structure for that data, and how the data is visualised and parsed.
While we’ve seen ‘vertical’ integration of these technologies in specific platforms and runtimes (such as .NET, or Spring) there’s been less success in creating broad ‘horizontal’ integration across a variety; Standardized telemetry from Go, Javascript, C#, various network appliances, container runtimes, etc. is much harder to come by, especially when you try to integrate multiple versions of any of these, a common occurrence in most companies.
Agents obviously serve as a point solution here, but imperfectly. An agent can not only consume telemetry from multiple systems, it can process it into a desired format - and, quite often, agents are capable of hooking into systems that don’t emit certain types of telemetry and producing it for you; For example an agent is capable of generating trace data from web applications by wrapping connections in order to perform APM or RUM.
It’s a lockdown, baby.
With this in mind, it’s easy to see how commercial products have embraced agents. They’re easy to reason about, their value can be explained clearly, and they dangle the promise of immediate results for little effort on your part. You can certainly see how they make sense in a world where server boxes and virtual machines were the primary resource of computing. What is odd, however, is how commoditized these agents have become. It’s true that there’s distinctions between the agents in terms of configuration, performance, and other features -- but there isn’t a lot of daylight between them in terms of the actual telemetry data they generate.
This isn’t terribly surprising if you consider that the agents are scraping telemetry data that’s made available through whatever commodity database, web server, or cache you’re using. We could ask ourselves why these commodity services don’t all adopt a lingua franca - a single format for expressing metrics, logs, and traces in. Perhaps a more interesting question is why the agents that are monitoring them don’t share a common output format.
Your personal answer to these questions has a lot to do with how cynical you are. A more virtuous-minded individual may surmise that these agents are a necessary and valuable component of a monitoring system. The most cynical might suggest that vendors have explicitly relied on these agents being difficult to migrate away from, and have resisted standardization efforts in order to encourage lock-in to their products.
Personally, I land somewhat in the middle. Obviously, vendors prefer a system that’s easy to set up and hard to leave, but agents have made it easy for engineers and operators to quickly derive value from monitoring products. However, the impedance mismatch between what’s good for quickly onboarding and what’s good for long-term utility can be staggering, and contribute to flaws in observability practice.
Bad Booking
Agent-based monitoring systems tend to discourage good observability development practices in applications. They encourage a reliance on “black box” metrics from other components in the stack, or on automatic instrumentation of existing libraries with little extension into business logic. This isn’t meant to downplay the usefulness of these metrics or instrumentations, but to point out that they aren’t enough.
Think about it this way -- what would you say if the only logs you had access to for your application were the ones that your web framework or database provided? You could muddle through, but it’d be awfully hard to track down errors and faults caused by bugs in the code you wrote. We don’t bat an eyelash, though, when it’s suggested that our metrics and traces be generated entirely through agents!
Let me be clear, I’m not advocating that we throw our agents into the trash (at least, not immediately), or that the telemetry we’re getting from them is somehow valueless. What I am saying, however, is that our reliance on these agents is a stumbling block that makes it harder to do the right thing overall. The fact that they’re so ubiquitous means that we rarely get the chance to think about the telemetry our entire application generates in a holistic fashion.
More often than not, we’re being reactive rather than proactive. This can manifest in many different ways -- suddenly dealing with cardinality explosions caused by scaling insufficiently pruned database metrics, an inability to monitor all of our environments due to agent cost or overhead, or poor adoption due to dashboards that tell us about everything except what’s actually happening. New telemetry points are added in response to failures, but are often disconnected from the whole, leading to even more complexity. Our dashboards proliferate for each new feature or bug, but failures tend to manifest in new and exciting ways over time, leaving us with a dense sprawl of underutilized and unloved metrics, logs, and status pages.
Agents have placed us in an untenable position, not only as engineers who are tasked with maintaining complex systems, but as their colleagues and organizations. Observability is a crucial aspect of scaling systems and adopting DevOps practices such as CI/CD, chaos engineering, feature flags, etc. Agents don’t make a lot of sense in this world! They artificially limit your view of a system to what the agent can provide.
The insights delivered by agents can be useful if you know what you’re looking at, but the custom dashboards and metrics delivered for a piece of software such as, say, Kafka, can be inscrutable to people that aren’t already experts. Spurious correlation due to data misinterpretation can lead to longer downtime, on-call fatigue, and more. While considering the engineering impact of agents is important, the business impact is crucial as well.
Datadog’s default Kafka dashboard. If I don’t know a lot about Kafka, how can I interpret this?
What, then, is to be done?
If it isn’t obvious by now, I don’t think that agents are the future. In fact, I think that agents will become less and less important over time, and that developers will begin to incorporate observability planning into their code designs. There’s a few reasons that I believe this to be the case.
- Projects such as OpenTelemetry, OpenMetrics, and OpenSLO are creating broadly-accepted standards for telemetry data and how that data should be used to measure performance.
- Standardization on OpenTelemetry will lead to instrumentation becoming a built-in feature of not only RPC libraries and web frameworks, but cloud services and external APIs.
- The glut of telemetry that’s being generated by our systems will drive a focus on cost controls and observability ROI, necessitating a reevaluation of traditional agent-based approaches.
It’s striking to see how OpenTelemetry has changed the conversation about monitoring and observability. Perhaps it’s unsurprising, given its popularity and the level of commitment that the open source and vendor community have given it already. Kubernetes has integrated OpenTelemetry for API server tracing, and multiple companies have started to accept native OpenTelemetry format data.
OpenTelemetry is the second-most active CNCF project, after Kubernetes.
OpenTelemetry enables developers to spend less time relying on agents for ‘table stakes’ telemetry data, and more time designing metrics, logs, and traces that give actionable insights into the business logic of your application. Standard attributes for compute, database, serverless, containers, and other resources can lead to a new generation of performance and cost optimization technology. This latter step is crucial, as it’s not enough to simply log everything and age it out; We need to measure what matters in the moment, while preserving historical trend data for future use.
The world that gave us agent-based, per-host monitoring isn’t the world we live in today. Systems aren’t a collection of virtual machines or servers in racks any more; They’re a multi-faceted mesh of clients and servers scattered across public and private networks. The agents that exist in this world will be smarter, more lightweight, and more efficient. Instead of doing all the work themselves, they’ll exist more as stream processors, intelligently filtering and sampling data to reduce overhead, save on network bandwidth, and control storage and processing costs.
I’ll Be Watching You
One thing I try to keep in mind when I look to the future is that there’s millions of professional developers in the world who don’t even use the state of the art today. The amount of digital ink spilled over endless arguments about observability, what it is and isn’t, and the best way to use tracing to monitor a million microservices pales in comparison to the number of people who are stuck with immature logging frameworks and a Nagios instance. I don’t say this to demean or diminish those developers or their work - in fact, I think that they exemplify the importance of getting this next generation right.
If observability becomes a built-in feature, rather than an add-on, then it becomes possible to embrace and extend that observability and incorporate it into how we build, test, deploy, and run software. We’ll be able to better understand these complex systems, reduce downtime, and create more reliable applications. At the end of the day, that gives us more time doing things we want to do, and less time tracking down needles in haystacks. Who could ask for anything more?
About the Author
Austin Parker is the Principal Developer Advocate at Lightstep, and has been creating problems with computers for most of his life. He’s a maintainer of the OpenTelemetry project, the host of several podcasts, organizer of Deserted Island DevOps, infrequent Twitch streamer, conference speaker, and more. When he’s not working, you can find him posting on Twitter, cooking, and parenting. His most recent book is Distributed Tracing in Practice, published by O’Reilly Media.