Transcript
Yen: Welcome to Observability and the Development Process. Today, we're going to be talking about how it's not just for ops anymore. Before I begin, I'm Christine [Yen]. I also work for a company called Honeycomb. While we are a vendor, and there will be some screenshots from our product in this talk because it is the case study that I understand well, I like to say up front that the ideas and techniques I'll be talking about should be applicable to a wider variety of tools than simply our own. Consider yourself disclaimed.
Dev & Ops
As I mentioned, I'm Christine [Yen], and I'm a developer. When I was young, many years ago, I was bright-eyed and bushy-tailed, and one of the things I took a lot of pride in is that I could fast. I could write some code, I would do the responsible test pieces, and I would commit, learn to master, and move on to the next thing. I could do it over and over and over again. I somehow made it a couple years before interacting with my first real ops person.
I was working on a new component. This is in 2012, back when dev and ops were still two very distinct things. Because I was working on this new component and she was kind of my counterpart, the little development cycle I've gotten so proud of churning through quickly started to look a little bit more like this. I started to learn why certain unflattering stereotypes around certain roles in engineering exist. Because she would start to say things like, "No code is good. The only code that's good is code you delete." Which, would, of course, make me get defensive and say things like, "It worked when I ran the code. It must be your machines, not my problem." Once I got over my indignation, she got over her frustration, it started to become clear to me that, "Maybe these small safe changes that I thought did work on my machine are not so small and safe when they hit production."
There's a great Medium post published in the beginning of this year by an engineering leader at Expedia where he analyzed several hundred production incidents. He combed through all the reports to identify triggers, not root causes, but triggers, the things that cause the incident to surface. Change is the largest cause here – some change in configuration – but what he found is largely it was changing code. It makes sense. On a macroscale, the world is also changing. We are going from the simpler world of LAMP stacks, single database app server. You've got your pets. We're moving towards a world where there's more moving parts, more surface area, more ways for all these individual pieces to interact in fun and exciting and unexpected ways. Even if we wanted to cling to the way that we, humans, worked around software, the thing that we had in 2012 with the very clear ops and dev split, we just wouldn't be able to. There's just too much surface area. Instead, we should be thinking about our systems and looking at our systems in a new way.
You're all here at QCon. A lot of you know this already. Hopefully, a lot of you are feeling this already and living this already. Maybe some of you in the audience are already trying to bridge this gap. Maybe lots of you, I imagine, are already practicing DevOps. Maybe some of you are putting your developers on calls so that they have to be responsible for their code when it hits production rather than just saying, "It works on my machine." If the first wave of DevOps was all about getting ops folks to automate their work to enable infra as code and config as code, and containers, and microservices, and orchestration, and some of the new flexibility and complexity, then the second wave has to be pulling back a little bit in the other direction, where we enable developers and empower developers to really understand what's happening in production, because we're the ones writing the code and causing the change in the first place.
Underneath all of this has to be the shared sense of software ownership. It's about caring about the outcome, from development, all the way through the end-user experience. Where that begins is visibility, whether you're fighting fires or being proactive, it's about understanding what is actually happening, why, and building that awareness of what happens when you take code that worked on your machine and set it free in production.
Terminology
Now, a little bit of terminology. There are many folks out there who are, "Observability, it's not anything new. It's three different types of data. You just need these three types and then you're set." Conveniently, they usually are there to sell you those three types. I like to think, instead, of observability as you look at this, there's an ability in the word, it's an ability to do something. Wikipedia has a very official definition that mechanical engineers have been using for a long time. In the software world, I like to think of it as, instead, "What is my software doing and why," being able to ask this question of, "I can see this, let's look deeper, let's understand." I think it's a valuable term to distinguish itself from monitoring.
Monitoring, in a way, is the world as it was. "LAMP stack, we can start to predict the ways that it will misbehave. We can start to predict what good behavior looks like, what bad behavior looks like, the signals." You have known unknowns. Put it in the box, watch it. If you put [inaudible 00:06:27] outside the box, you fire an alert and wake up your on-call engineer.
Observability, on the other hand, is recognizing that the system is no longer understandable, certainly not by an individual, certainly not something that you can page into your head and define by strict thresholds. It's, instead, something where you have to be able to dig and investigate and look at the whole picture and then zoom in from a different angle. This is the last digression.
What if instead of this pattern, knowing what we're knowing or thinking what we're thinking around ownership and visibility, what if instead we take what we learn from testing, where we are able to incorporate a feedback loop into our development process and observe each step along the way? By observing production earlier and earlier in the development process, devs will build up that ops sensibility and will be able to make sure that the code that they are shipping is incrementally more ready for prod. That's how we bridge this gap. Because observability is more than just three types of data, is more than just a toolchain. It's about the impact that those tools have on your humans, how we understand our systems, how we build our software, and the processes that result. What does this all mean? Let's get into some examples.
Make Haunted Graveyards Less Scary
This is a phrase that I like that I stole from our developer advocate, Liz Fong-Jones, where she describes observability and this visibility into your systems as something that allows you to make the haunted graveyards of the code that no one understands less scary. CJ Silverio, who is now at Eaze, set up some tweets the other day. She'd recently gotten her team to prioritize observability. They did a lot of work instrumenting a legacy monolith. Suddenly, she realized what it was like to be able to see. What they found is that the more they were finally able to see into this legacy monolith that no one understood, especially because they were trying to refactor it, they were trying to update it, they were trying to minimize downtime, all the things we can relate to, the main impact was that it became less terrifying to touch. By embracing observability, we can quickly understand what real-world production workloads look like and get a whole lot more comfortable in production.
Why am I talking about this from the perspective of a developer? Because I don't want this to be just about knowledge transfer. I don't want this to be just about like, "Share the responsibility, set some workshops, uplevel your developers." Developers should care about this because understanding production, understanding how to think about those two steps ahead will make them better at writing code.
When you look at a modern software development process, we have all these steps. We are so responsible these days. We have so much curiosity where, at each step, we're, "Ok, is this code I'm writing going to work?" We have all these different stages that test in progressively more complex environments. If you think about it, this last step, that's testing too. That's trying to figure out on the ops side of the wall, whether what we expect is actually happening.
As a developer writes good preproduction tests, they get really good at thinking through various scenarios, describing how the code should behave and investigating when it deviates from these assumptions. More practice we get at this, the better our code gets. We know this. It's been proven. I'm sure there are many science papers about this. It's valuable because you begin to think ahead and identify and then validate these assumptions. it's sorts of things that ops folks do in prod, oddly similar. Expected to look like that, there's a weird arm in my graph. Why?
Dev&Prod
Ultimately, there doesn't have to be this tension between the folks who are writing the code and the people who are responsible for it, even if that's the same person at different points in the day. Both sides are just trying to do the same thing in their respective domains. As we start to embrace software ownership, this idea of owning your code all the way through, the idea that there should be separate domains starts to break down as well. Instead of developers hanging out in development and ops hanging out in prod, we need to build the muscles to make it so that developers start to feel like prod is part of the development process, like it's an extension, but just the next step after continuous integration. That bridge is still going to be observability. It's still going to be that way of checking. Expected, actual.
What does this look like? We saw this this flow before. It's all about feedback loops. That's how we learn. That's how we get that expected versus actual. That's how we figure out when there's a difference and go back and tweak the code. Where can we see some feedback loops where we learn from production and have that information feed into the decisions we're making at that point in time? When we're deciding what code to write in the first place. Somewhere it has to come, there's a problem we need to fix that's happening in production, there's a thing that we need to improve that has impacted people in production. When we're deciding even how to write that code, what cases to de need to cover, what are edge cases, what's the baseline, what things can we just completely ignore. You can use what's normal in production as well. This last bit, whether the code actually works. The ops person from 2012 would be laughing the hardest at this point, to think that we can imagine whether our code actually works, laughing at the fact or the idea that we could figure that out without actually running them in prod, without actually testing it, without actually learning from what we've seen.
What does it look like to use code from production to decide what code to write? When we develop locally you're relying on signals from your code, log lines, your IDE, all sorts of things that say, "Your code is actually doing this." Because you're ultimately trying to answer this question of "Why is my code deviating? Why is it doing this thing? It's supposed to spit out a picture of a monkey. It's spitting out a picture of a panda. It's not what I wanted." The value in being able to quickly get those answers from production data means that we don't have to guess at what might be most valuable, what might be most impactful.
I mentioned CJ at Eaze earlier. They have a great story. Their legacy monolith has caused them lots of pain, lots of downtime, very slow, bad customer experience. What they did once they got this visibility is they were able to very quickly figure out which endpoints were the most painful for their customers. Sounds like a pretty standard APM case actually, "Ok, latency by endpoint, I get it." What they needed, because this was a very custom beast, was a level of flexibility in the things that mattered in their application that they didn't see, they weren't able to get from traditional APM, and they needed customizability.
If anyone caught Ben Sigelman's talk yesterday about how we're all moving towards deep systems, this is a perfect example of that. They needed to understand what was going on at a level beyond just surface level APM visibility. What they were able to do is, once they had this rank list of user pain, is they were able to say, "Ok, we have a limited number of engineers, we're going to start from the top, and be confident that this is actually the highest impacting that they can be working on."
What about when we're deciding how to write the code? We're doing something, we need to understand the potential impact of the change we're making especially if it should have a direct obvious impact on users. By making sure that we know what is really happening out there, we can figure out whether this code that we're pushing out there should be a fix is actually a fix. I love this section because it turns the events that our software emits essentially into debug statements in prod. You can use this even when you've got a local branch and a solution, something not fully baked that you're working on, where you want assumptions validated.
Here's the story where very early on in our lifecycle there was this ticket. Honeycomb, we ingest events, we unfurl JSON payload, we unfurl it, ship into our [inaudible 00:16:47]. This ticket just said, "For any JSON nested objects, unroll it." We were excited, and 2012 me wanted to be like, "Yes, I'm just going to do this. I'm going to ship it out. It'll be great." Someone else was, "Hold on. Let's actually make sure that this is a good thing for everyone. Let's find out who it impacts and also maybe see what the production impact on our service. Is this going to impact every payload? Is it going to impact a very few of them, only the person who's asking?"
What we learned by throwing some instrumentation in place while we were actually working on the code and logic to do that unrolling, we measured who has nested JSON, true or false. We were able to see that only a handful of datasets were sending us nested JSON, but it was a huge portion of our traffic. It was basically about half of our traffic. Maybe arbitrarily changing how nested JSON payloads were handled results in a nasty surprise for them and an unexpected load on our systems as well. By having a flow where it's lightweight and natural for software engineers to think, "Let me just add this bit, it'll go out in the next deploy, and I can learn while I go through code review or while I talk this through," we can make more informed decisions and deliver better experiences to our customers.
What about this last piece? What about deciding whether the code works or not? We saw in the beginning we are now in a world where there are so many moving parts. We have lots of different customers, heterogeneous traffic. We can't predict anymore how everything might go wrong. We can't encode in our preproduction tests everything that we might want to test. Ultimately, we have to think of production as a separate testbed to be able to run these experiments, check hypotheses, but do so in a safe way that lets us immediately understand the output of that experiment. We love feature flags, and they pair really well with tools that are able to get very fine-grained visibility into what happens when you turn your feature flag on or off.
I got another quick story to illustrate this. Our storage engine is probably the most performance-sensitive part of our system, and so we are very careful about rolling out changes to it. Even after we've done all of the careful design, careful testing, some things you simply can't simulate. This is a very old screenshot, very old graph, but back in the old days, we had this one change that we knew was going to impact some datasets, some customers more than others. We'd done all the responsible things, and we shipped it out to production behind the flag. Using this feature flag let us very carefully define what segment of traffic we wanted to send to this new code, which should not have had a performance impact. It did.
Just get that feedback loop immediately where we could immediately see when we change that flag, the impact, and start to dig into why. If you're using feature flags but don't have the ability to get that "I'm going to actively split this graph by whether the feature flag is on or off, whether the feature flag is at value A, B, or C," you're essentially flying blind. You're just turning switches and you're not able to close that loop. This grouping by feature flag, internally, in our datasets, we send basically all the values to all the feature flags so that someone can come in after the fact and slice, dice, or whatever they want.
Usually, at this point, someone is sitting in the audience, going, "All this is old news. Are there really dev teams out there that write code without thinking about what will happen in production? I don't believe that." Are there really people that don't validate their assumptions? Yes, there are. If you are one of those people who are skeptical, you have made excellent choices in life and somehow have avoided many mistakes that befall off modern dev teams. Even the ones who are, even the ones who are doing this, who think that they do, are they experimenting the lightweight way? Do they feel free to run these experiments in production? Because this is hard. Baking these changes into your processing or tools when you've been accustomed to dev for years and years, these are new muscles to gain. Production doesn't feel like the dev environments we're used to. Sometimes trying to interpret those graphs on that upside of the slide can feel like learning a new language.
Make Prod Feel More Like Dev
One trick that works wonders for helping one language feel like another or making production feel more like dev is to disguise it. I want to talk a little bit about what I mean when I say make prod feel more like dev. One of the things that was very clear to me is that monitoring tools make assumptions. Also, monitoring tools spend a lot of time with nouns like these. Back in 2012, one of the most stressful interactions with this ops person happened when she came over and was like, "Christine, CPU is way up on half of the Cassandra cluster. What did you do?" "God, I don't know. I'm sorry, I was just writing my code and shipping it." In this case, this got very good at looking at their graphs and scrolling through and pretending like I could understand what it meant when one graph did this and then the other graph did this and how that mapped back to my code, my tests, and something that was within my control.
One of the things that helped is that when we started to introduce terms that I was more comfortable with, terms that showed up in my tests, it became a lot clearer how I could play a part and how I could fix the situation. Compare that initial outburst with, "Christine, API latency is way up on our most expensive endpoint for our biggest customer since our latest build, what did you do?" Ok, now I know when it happened, how important it is, who it's probably for, and maybe some characteristics of that customer's workload, because these are things that I need to know for the business. When you challenge your tools to allow you to customize the nouns that matter to you, your business, your tests, you create that connection layer that enables devs to feel like these tools are theirs too and that there's some action that can be taken instead of just scrolling through, hoping the ops folks will figure it out and leave you alone.
Tools also need to be able to work with the shape of the data that I need. AWS availability zone, you're unlucky, you have a small handful of possible values here. If you're on customer ID side, ideally, the more you have, the better you're doing. Customer ID is a high cardinality value. There are many possible values for its particular field. The thing is, as a developer, at any given time, I might care about one of these, or I might not even know which one I care about, I just need to know which customer is the most affected or which customer is taking my MySQL cluster down.
One of my favorite stealth high cardinality values that has incredible value to developer is build ID. This is a graph that shows volume of requests over time broken down by build. Pause. It is nice to think that deploys are instantaneous. We all know that they are not. This graph makes it very obvious which services are serving one build, which services are serving another. For the purposes of this, there are even markers that show, "This is the duration of the deploy, spanning over that crossing line."
Another example, I talked a lot about proof. This is another case where we rolled out a build that should not have had an impact on great performance and did. Once we've broken that build ID, we could pinpoint, "Ok, it's this." None of this trying to align to timestamps, none of this trying to figure out, "It happened around 2:30, but then a build went out and got delivered around 2:15. Was it that one? I don't know." Here, there's no questions. We can pinpoint, move on. We'll go revert that commit and then move on.
Again, go through this dog photo. It's the closest thing I could come for iteration. These tools should make it easy for me to add instrumentation and an evolution to my software development process. Instrumentation should end up feeling the same way documentation and tests do. They should evolve alongside the code, they should not have to be some enormous one-time effort, they should be incremental. This means flexible schemas, no gatekeepers. We've talked to folks where, for a developer to generate a new graph, they had to file a ticket with their ops team. Very sad. It's a giant barrier to making it feel like a natural part of my workflow.
Sharing patterns where possible. Here's a nonsense code on the left, and an imaginary trace on the right. You can imagine how traces may be a lot more exciting to map to, "I understand this. I understand how to use this to understand how my code is actually behaving," especially when you're not as familiar perhaps in looking at a high-level aggregate view of what your system is doing. This is just visualizing a heatmap of durations, what things are taking how long. If we want to say, "Show me an example of a request that is a lot slower than normal," and a developer can click through and immediately see an example of, "This is what my code looked like. This is what ran, this is what I executed a whole bunch of times in a row in series. Maybe it shouldn't, maybe it should."
I showed this to a developer once, a couple years ago, and they were, "Tracing feels like Chrome developer tools but for non-browser software." In a way, that's the perfect description. It is a perfect description of this whole exercise which is to make these things that feel like ops tools or that we think of as ops tools feel like things that developers deal with every day. The nicest part about building these bridges, looking for ways for you to jump from kind of one mindset to another and back until it doesn't feel like you're in ops world or dev world anymore. Understanding, "Ok, I'm looking at this one execution. I understand how to map it back to my code," and also, here's a way to understand its impact in the larger system. Because we can generate a better understanding of the behavior of our system if we can relate to the whole system.
Instrumentation doesn't have to be some giant effort. You can go from unstructured text logs, add a little bit of structure, go full structure, and each of these things allows you to think critically about what data is useful in production that might be useful to developers that they're already using in dev. Extracting out business identifiers, thinking about the entities that you throw into your logs without thinking, "Can we throw this into our structured logs in production also without thinking?" Because this is what lets us start to think analytically about our machine data while leveraging graphs to pick out patterns, to identify trends, rather than trying to scroll through a page for text. If you think about it, tracing is what happens when logs grow up. We got a little bit of structure, you group things together, and suddenly, you have a whole lot more powerful experience that feels native to developers but gets you that high-level analytical insight that you need from an ops perspective.
Prod, Part of the Dev Process
We talked a little bit about what observability in the development process might look like. That was a little bit of a red herring. I want to talk about what those stages and what those feedback loops might look like when we embrace the high cardinality, high dimensionality nouns that we know matter. In this way, devs can start to feel comfortable in production and iterate on the experience.
This initial stage of why is code deviating from expectations often happens during incident response. I'm going to tell a story from an outage in May. This is an excerpt from a blog post. While the incident initially manifested at traditional ops signals, health, system metrics, we ended up tracing it back to the logical layer, where a key insight was finding out how heavily a single cache key contributed to the problem. This is the sort of visual analytical insight that is only possible when you allow tools to speak dev and to have those high cardinality values that map closely to the code, which in turn is only made possible by everything we've just talked about. We've had benefit from this on the most.
In theory, product managers play a big part here, thinking through use cases, weighing customers, figuring out what is important to do. The fact that you are in this room, instead of, to talk about product managers and engineering, likely means you're interested in keeping control of that. Honestly, I think of observability as a way to enable that conversation between product and engineering in a way that wasn't really possible before. Here, this is a heatmap of event in just delay, so how long was it between when a customer sent us data and when it hit our API.
This yellow section, with me being curious about "Are people sending us data from the future," this is a delay. It's negative, something weird is going on. I was able to find out down here that this yellow bar tells that one customer is sending us 84% of our future timestamps. Weird. Why? What are they using it for? Does it matter? Do I need to worry about this? What is the impact on our storage engine? All of these things start to come into play when I would never have expected lots of future timestamps. This is, again, the sort of immediately actionable insight that is only possible for having our tools picked up by baking this into the 'Man, maybe I should check. Maybe I should look at this before I write the code that makes assumptions about what cases I need to care about.'"
Testing in production. Production is a great testbed for hypotheses. This is a story about our friends at Geckoboard who were working on a new feature that essentially reaches down to a bin packing problem. They paused and they're, "We know some of our engineers would love to take this opportunity to spend a week coming up with the perfect design for what we think will work." Instead, they were, "Hang on, we can measure this. We can test this and see." They are way ahead of their public release, wrote up three different solutions, push them all to prod. Production traffic ran through it, they measured the results and then threw it away. No impact on the actual users, nothing visible, but they were able to see how well each of these three experiments ran in production, picked the one that performed best, and moved on in a day.
This is the sort of thing that can happen when developers feel comfortable experimenting. They feel comfortable adding that instrumentation, watching it deploy, run out, rule the instrumentation out. These are such cultural changes, and the things that can be possible, all without an ops person having to yell.
The whole thing here is we're trying to go from this to this. The things that we're dealing with day to day, depending on which end of the spectrum you lean, may differ, but in the end, the globe is shrinking. We need to figure out how to speak each other's language.
Takeaways
Instrumentation and observability should not just be checkboxes at the end of a dev cycle. They should be embedded into each stage and continually checked to keep us grounded, because that's what builds up the awareness that allows us to be responsible for our code in production. By capturing this data when it's lightweight, it can feel like tested in development where we are making better choices, we are becoming better developers as we go. We need to be able to ask these new questions of our systems so that we can ship more confidently knowing, "Great, I've thought this thing through. I know what this looks like. I know what my code will face out there."
On the ops side, the assumptions that we make about production don't have to apply. Can we bridge that gap? We have a shared goal now of enabling our developers to share the load. As it turns out, making observability more accessible, more flexible, caring about the human factors, whether you're paging people too much, all these things make devs more inclined to put their head in, take some ownership to help out with instrumentation.
Questions and Answers
Participant 1: My question is on observability part where you mentioned that, for developers, it's ok to have debugging and pressing in production itself. Don't you think it'll slow down because you are putting a lot of unwanted code just for the sake of pressing and just for the sake of testing it on production? Did you have implications? The second part is, when you are testing in productions, you are literally tossing your [inaudible 00:38:49] implications, because there will be implications. Is there any advice you have with respect to timing or with respect to creating a dataset in productions or something which will practically, as a developer, I can think of implementing?
Yen: The first part, I understand, is how do we minimize the test code piling up as craft in production. The second part I didn't quite catch.
Participant 1: The second part is when you are trying to test in productions, there's various implications [inaudible 00:39:24].
Yen: Are there any hardware implications to testing in production?
Participant 1: Yes, because you are testing [inaudible 00:39:31].
Yen: I think this is true with feature flag as well. There's some hygiene involved in code that is delivered to test hypothesis versus provide value to a user. I think testing in production is something that is on the way to delivering for customers. Ideally, you are cleaning up as you go. From our experience, there does not tend to be a ton of additional code written for the purposes of testing. It's more like, this is something that is in progress or this is a little bit of instrumentation that ends up being potentially useful in the future. Hardware implications of testing in production, that implies that something is generating additional load, which is not my intent. When we say testing in production, there's already a whole lot of traffic, use cases flowing through your production system. Take advantage of that. It's not saying generate more. It's just folks are using your system. See if there's a way that you can measure what they're doing or measure something so that you can answer a question about what that production environment is really like.
Participant 2: I'm a little bit confused. You showed how you've been a few weeks before the public release and you were able to test three different scenarios in production, but I guess you didn't really have a production traffic at the time.
Yen: We're talking about the Geckoboard story with the bin packing.
Participant 2: Yes.
Yen: There was production traffic. It was a feature that had not been released, and so they were able to, again, look at what their current customers were sending and use that as a proxy for what they'll see when they turn this feature on.
Participant 2: You just briefly put it in production, look at how it performs, and...
Yen: Yes, but it doesn't even have to be the whole feature. This is like the nested JSON example. We didn't actually have to write the code to unfurl nested JSON. We just had to add a thing that said, "Does this payload contain nested JSON, yes or no?" In that way, we're utilizing the same flow of traffic that's already going through the system. We're measuring something that can be useful for deciding whether to go through with actually submitting this PR or not.
Participant 2: I have another quick question. I usually work in heavily regulated industries where the devs could not even look in front of customer data. I mean, they can, but it involves tons of paperwork. Any advice for this type of environment?
Yen: Ultimately, this is contingent on having some flexibility in how you're creating these feedback loops between production and development. If your developers are walled off from being able to create that feedback loop, then those policies are preventing the developers from making sure that the code that they write is the highest quality. There are ways to improve this without just saying those are bad policies, but getting as much visibility into what's really happening is key. Maybe it's better test data. Maybe testing in production is just off the table, but what a shame.
Participant 3: I'm just wondering, in terms of the future, do you have a vision as to these roles, more of like a role takeover, like a developer or an SRE engineer actually becoming more full cycle where these positions actually start to merge into a single entity?
Yen: I suspect some people might say that it already has. I think that, with any organization, as the organization grows larger, we'll also have to specialize. I think that's very dependent on how those engineering teams want to run their organizations. Certainly, we're seeing a trend of folks who have an SRE rotation and volunteer devs joining in, and organizations thinking really hard about how to make that accessible, how to share knowledge, build up an institutional understanding of how to deal with certain situations because they see the benefit of spreading the skillset across the org. I wouldn't be surprised if someday there were just these different engineers focused around a single service or focused around some subset of the system.
See more presentations with transcripts