BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Panel: Observability and Understandability

Panel: Observability and Understandability

Bookmarks
41:55

Summary

Jason Yee, John Egan, and Ben Sigelman discuss their approaches and preferred methods to get impactful results in incident management, distributed tracing, and chaos engineering.

Bio

Jason Yee is director of advocacy at Gremlin. John Egan is CEO and co-founder at Kintaba. Ben Sigelman is a co-founder & CEO at Lightstep.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Butow: This panel brings together innovative leaders from the world of incident management, OpenTelemetry, large scale distributed systems, to discuss and share their current thoughts on the state of observability and understandability.

You'll just be hearing from our technical experts who've actually been in the trenches. They've achieved incredible results, and they've learned a ton along the way. The speakers on this panel didn't just do the work, they also created entire movements to bring thousands of people along with them to achieve and celebrate success in these areas, too.

Introduction

Yee: I'm Jason Yee. I work at Gremlin as the Director of Advocacy, which means I get the amazing pleasure of working with Tammy on a day to day basis. I've given talks on everything from monitoring and observability, to chaos engineering, to innovation, which is my passion, is how do we help engineers be better and create the next big thing, or just improve their lives?

Sigelman: My name is Ben Sigelman. I'm one of the co-founders and the CEO at Lightstep.

Egan: I'm John Egan. I'm CEO and co-founder at Kintaba. We're an incident management platform that try and help companies be able to implement best practices for incident management across the entire organization, as opposed to just within their SRE teams. I gave a talk about why it's important for companies to increase the overall number of incidents that they're tracking internally as a top level metric, as compared to maybe some of the more standard MTT* metrics.

How Different Companies Manage Incidents

Butow: Previously, when I worked at Dropbox, I worked with a lot of folks who were from Google, and YouTube. Then also a lot of folks who were from Facebook. It was really interesting to hear their different viewpoints of how they did things. Ben, like you came from Google. John, you came from Facebook. I thought it would be interesting, have you noticed that yourselves, that there's a different view from Google to Facebook of how to, for example, manage incidents? You both talked about SLOs and SLIs. What are your thoughts there?

Egan: Having not worked at Google, I won't try to speak directly to how things operate there. I'll say what I thought worked really well at Facebook was its ability to ingest the incident process across the entire organization. Such that if you're an employee at Facebook, and something's happening deep within the engineering team, deep within the infrastructure team or otherwise, you always have immediate and open visibility into everything from the top level information on that incident, all the way down to the specific details of exactly what's being done. It was this really powerful impact of everyone in the company knew what a SEV1 was. Everyone knew how to respect that and stay off people's backs when things are being worked on as high priority. The cultural impact of that was really strong. It meant everyone from sales team out to PR team, out to otherwise, responded really well to the occurrence of these incidents. As a product manager there, where I primarily operated, that was really impactful for me, as a non-engineer, actually. In that the way that we responded to incidents started to look similar to the way that engineers and SREs, or groups of people who've been doing this for much longer than us would approach things and we would match that. I thought it was a really powerful indicator across the company of shared cultural values up from engineering. One of the reasons I pushed back on things like MTTR, and some of those areas is those top level cultural impacts, I thought long term actually had a more visible long term resilience effect on the organization versus just like the deep in incident response metrics themselves.

Butow: Yes, taking the bigger picture.

Sigelman: I never worked at Facebook, so I'm not going to speak for the Facebook side of it. I would agree, Tammy, there are some cultural differences between the way that SRE is practiced in those two organizations. For Google, it's funny, they did a lot of things right. I think it was really cool and progressive of them to say, we're not going to call this Ops, we're going to call this site reliability engineering, which I think they coined that term. I believe they did anyway. It seemed very new when I heard it, and think about this as a real practice. It was, I think, really amazing that they did that. It was so amazing that they created their own organization for Site Reliability Engineering, that was parallel to the product and engineering organization. I'm not sure that was such a good idea. They ended up creating the org chart. When there's 10 people, it's fine. When you have 1000 people in the organization, it does inevitably become somewhat political. There just wasn't that much political incentive to be reasonable.

I do remember a meeting I had once, where I was talking to a very senior leader in the SRE organization, about Monarch, which was our distributed, multi-tenant, time-series database, like a metrics system. We were trying to figure out what the SLA should be for Monarch. I just asked him like, "What availability requirement do you need?" He's like, "I need 100% availability." I was like, "That's not reasonable. Can you tell me, like how many nines? Can you just work something, something not 100?" He's like, "No, I need 100% reliability." It wasn't reasonable, but he had no reason to be reasonable politically. I think having SRE as a completely segregated organization allowed him to be completely unreasonable. He was.

As a result, you can go look at the Monarch paper, it's awfully reliable. It also consumes, steady state, 250,000 VMs are dedicated to Monarch. Just think about how expensive that is. It is extremely expensive. That was not, in my mind, actually, the right decision. We probably should have, instead of shooting for four nines, should have shot for like three nines, and then had some other thing for the super high availability or whatever. It's just the political situation as it wasn't his P&L, and he didn't care. He just asked for an unreasonable amount of reliability, and here we are. I think that there is a lesson there. SRE is an amazing thing, but I think there needs to be a check and a balance on reliability as a goal. It needs to be balanced with other product velocity concerns and cost. I don't think that Google set things up to succeed in that regard. Not to say that everyone is political, of course, but it was an issue.

Butow: Yes, it's definitely an interesting conversation. This came up also, in your talk, Ben, about cost, just like the cost associated with observability, being able to understand systems and what's happening. Amazon's famous for everyone saying that they're very frugal. They're trying to always reduce costs and keep costs down. It's one of their cultural items that they really do focus on. It's just interesting, different companies do things different ways and have different goals that they strive towards. I think that's great to be able to learn from how people do things.

How Hobbies outside Work Help with Observing and Understanding Systems

Something else that I'm really interested in, is how have hobbies outside of work helped you observe and understand systems? I always think this is really interesting, because our job, day to day, isn't the only thing that we learn from. What else do you learn from?

Yee: That's one of the things that I love about being an engineer. I think the question was about learning. I think that's a big reason why there's such a movement for this idea of bringing your whole self to work, is we learn from all these other things that we do. As an example, what we deal with. We always talk about complex systems. You can't look for single root causes, because we're in a complex system. Everybody has this idea of like, this implicit understanding, yes, it's a complex system. Then, oftentimes, we treat it like it's not. One of my pandemic hobbies that I picked up was chocolate making, like bean to bar chocolate making. I buy raw cacao beans, from an importer who goes around the world getting cacao. I roast it in my oven, and I shell it, and then make chocolate. It is simultaneously one of the most fun, enjoyable things that I've done and learned how to do, and also extremely frustrating because when you make chocolate, there's a process of tempering. The idea is that chocolate that you buy in the store, it doesn't melt in your hands, because it's tempered. It has this nice snap point when you break it. That's a process of crystallization. It's this extremely complex thing of like the smallest changes in humidity in your room. You can do almost everything the exact same and have something that completely fails, whereas you did it yesterday, and because that humidity was slightly off, it came out perfect.

I think that's something that I've started to realize, as we deal with the systems we work on, is we inherently understand that it's complex, but I don't think we realize how complex when we're talking about like, I automate deployments. We did two dozen deployments yesterday, and the idea of what worked in the morning may not work an hour from now, because you just had 10 different deployments going on. I love bringing in that stuff, because I think it helps us make better sense when we're thinking about the stuff we work on.

Butow: Yes, definitely. It helps explain it to folks as well that don't work in our world, if they ask you like, how come things just don't work perfectly every single day? They worked great yesterday. You can really take those different ideas and share that with people to explain it.

Sigelman: Honestly, the only thing that's really coming to mind, but it is coming to mind is parenting. I think that this has been like ultra-intense parenting year, and it is a learning thing. Like, how do you help your kids learn and experience the world when their world is so much narrower than it usually is? I think we've tried to do what we can. It's interesting that finally things started opening up, and we took my 3-year-old to an aquarium basically, and we thought this could be so amazing. He used to love the aquarium in the before times or whatever. He had his mask on, and everything. He was initially excited to get there. Then after like 20 minutes, he's like, I want to go home. It was like too much learning, like too much stimulus. It was very interesting. I've been thinking a lot about just in terms of learning about just the amount of stimulation and the type of stimulation that we receive. How if you suddenly make some giant change on that front, I think it can be very overwhelming for people. I don't know if that makes any sense. It's certainly not related to observability. It's something I've been thinking about a lot this year, as everyone's just been so narrowed in on whatever it is that we do. It's much more focused than normal.

Butow: I think it definitely makes sense in the world of understandability as well, because sometimes just being overloaded with stimulation can be really confusing. It's nice to have a simple way to be able to understand things, especially when you are working in very complicated distributed environments, not only systems but also people, it becomes very hard.

Egan: I think work just permeates everything that I do these days. In some way or another, I feel like I look at the entire world through the lens of incident response and incident management. The big thing that usually comes out for me around that is immediate prioritization of things and ways that my wife and my son definitely don't like, where we're interacting with an issue. You go, why is this really a SEV1? Let's be careful here. I have a 3-year-old, and so, it's similar in the parenting world, when we're trying to deal with something, we're trying to go out the door, and we really don't want to put our mask on, or we really don't want to put our shoes on. There's just screaming that you could hear down the street and around the corner. It's like, let's take a moment, and let's look at the data. Let's figure out whether the impact here is actually justifying the degree of response that we're putting into it. It's really funny, because I think 5 or 10 years ago, a lot of people who were really big junkies into the task management space, had a very similar approach to the world. Where it was like, task management is really just to-do lists, and to-do lists work everywhere. I'm using Asana at home to run my family. I think a lot of people try to go out and do that for a little while.

I've ended up doing the same thing with incident management, where it's like, mentally, in my head, I had this whole incident graph, and I want to put something up on the refrigerator that's like, ok, how many SEV1s did we have last week? Let's talk about them. Are we going to write anything up? Are we going to work on it? The reason I think it permeates similar to the task management space, is there's some truth to it. You can apply a lot of the blameless approaches to dealing with major incidents in the workplace that we've absorbed into products and best practices, for things like incident management. You really can't absorb those into your day to day life. You really realize how aggressively human nature fights against a lot of our learnings. We really don't want to do 'be blameless' in our day to day lives. It's very difficult to step back and say, these things probably went wrong, whatever they are, not having our shoes on, that's if we're going outside. All the way out to something bigger, like forgetting to pay a bill or something like that. The very natural and easy reaction is always to just say that's your fault, or that's my fault, or that's this person's fault.

I think the big learning that I've taken away from a lot of this, and working in this space more directly over the last couple of years, has really been this idea of look for the systemic cause. If you address that systemic cause you don't alleviate yourself of accountability, you're still accountable towards the thing that happened, but you're actually fixing the thing that caused you to act in that way. Very much just like, I think Sidney Dekker, in one of his books, says, most people don't come to work to do a bad job every day. If you go into every conversation about incidents with that in mind, you'll come out stronger. I think it's very much the same in personal life, parenting. I think it's the same in cycling. I think it's the same in all of these different hobbies and things that I do is, people in this room are doing what they're doing because they want to do a bad job. Everyone's here to try and do well and have a good time and live well. That's the way it permeated inwards and helps me stay cognizant of focusing on that aspect of day to day life and hobbies.

Butow: It makes me think about my own life as well. My husband, he's a game designer. Me coming from being on-call for many years, looking after distributed systems, making sure they're up and running. He's more about making sure that things are fun, and enjoyable, and really just awesome experience. That's what he does. He's been doing that for such a long time as well. For me, it's funny because I'll focus on what we need to prioritize and what we need to get done. Then he'll balance it out and make sure that even if it is super high priority, we need to get it done within the next hour. It's still fun and enjoyable. I think that's cool how you can balance the two things together.

Systems Mirror the Organizations That Build Them

We do have a question about games as well, a little bit. In the board game space, there is an ongoing debate about balancing the setting of a game and the systems or mechanics implemented. Often, game systems set in a colonial period bring up conversations about conflicts between the observations of a system and mechanics, and the themes abstracted from the system setting. Are there problematic themes, assumptions birthed from the abstractions built with modern day tooling for services?

Yee: I don't know that I would say that there are necessarily problems. I think that the way we think about breaking things into services largely affects what we build. You can think of Fred Brooks, and his whole concept that the systems that we build mirror the organizations that build them. If you decide that you've got a giant organization, and you're going to break it into different teams, those teams suddenly become your services. The way that those services connect, they're mirrors of the way that your teams connect. That can be problematic. I think the problems come when we presuppose that we're going to be in a certain framework, and that things have to be a certain way. From my own experience, this often replicates in the fact that we have the database team. Your team doesn't own the data. They always have to go talk to that database team, because we've naturally set it up that there's a team that manages the database, and they're the keepers of all the data. Naturally, then you start to have this abstraction of like, then I don't control the data that my service needs. Now I have to think of an abstraction layer of, how do I talk with that database team? I think that, yes, the way that we set things up naturally impacts the way we think about things, and it can create some friction.

Sigelman: This is in some ways exactly what I'm getting at with this idea of resources versus transactions. We want to be able to talk about the health of individual transactions, that's the point. That's what our customers care about. You can almost see these movements occurring in our technology ecosystem, where we're trying to make the resources align better to the transactions. In some ways, that's what things like Kubernetes are all about, like the idea of a service, it can get more abstract. You're talking about serverless, when it's literally just a function call that we're abstracting. The trouble is that you can't get away from the fact that at the end of the day, there are physical resources being consumed, and those crop up in reality. It turns out that serverless is a wonderful abstraction, but the cost of making a network call is still way higher, probably a thousand to a million times higher, depending on what network you're on, than accessing something on local disk, or something like that.

You run up against the reality that if you try to make the schema that you're using to understand and abstract your system. As you try to make that as friendly as possible, you drift further from the physical reality of your resources. Those, unfortunately, dictate performance and the cost profile of your system. I think this push and pull has been happening for decades, and will continue for decades. It's a great way to think about the talk I was giving around resources and how they relate to transactions in the first place.

Egan: I think a lot of the abstractions we see today, when we build abstractions out, we're really trying to compartmentalize complexity. We've got complex systems, then we build another system above it that's simpler, that then houses that complex system, and another system above that, that houses that complex system. There's the law of stretched systems, which was Rasmussen back in the '80s, was talking about this, where every time you do that, you're just going to stretch the new system until it breaks. Then you're going to build another abstraction, you're going to stretch that system until it breaks, and you're going to keep building up. I've always found that the biggest risk around those abstractions is you stop working to optimize the system that you abstracted in many cases. Because now you're going to have to start focusing on the breaking points of the new system you've built, on top of that. I think this translates really cleanly to tools. Otherwise, in my field, in incident management, it really translates directly to the increase in this butterfly effect of, one small thing done now propagates across these abstractions and down, and actually increases in order of magnitude each time that it works its way down.

I think it was in the '50s or so when some of this was being researched, it was the difference between the mistake of pushing a wrong key on a typewriter and pushing the wrong key on a computer. Your impact of a mistake at that typewriter key is very limited and very small because you're in that base system. Now that you abstracted so many levels of software and tooling and otherwise all the way up in the computer world and you've clicked a key, you might have accidentally just brought down an entire data center, or otherwise. I was saying the other day, we need child locks on keyboards now that everyone's working at home, because the risk is so high. I think there's a huge risk there as we do tooling and as we continue to abstract out, where the difference between building software today, and trying to understand it is now equally as complex as the infrastructure that we've gone out and built up. Maybe that's one of the reasons observability is such a big deal now. I just think it's fascinating that that problem is just going to continue to propagate. I don't think there's an end to it.

How to Interview for Observability Skills

Butow: How do you think that we should interview for observability skills? What are your thoughts there?

Sigelman: As a practitioner, you mean, so interviewing someone who would lead like an observability team or something like that? Is that the idea?

Butow: Yes.

Sigelman: Of course, we interview people all the time at Lightstep. That's the type of observability interview, but not the same type of observability interview. I think what I'd really want to see would be a focus on the outcomes of observability and the ability to articulate how those connect to the practice of it. In my head, there's some hierarchy of observability. At the very bottom layer, you have code instrumentation. Then above that, you have telemetry egress, and then telemetry storage, and then data integration. Then, maybe UI/UX, and then above that, like automation. The conversation is way at the bottom of that stack right now, and it's not valuable. Observability is only valuable if it accelerates your velocity, or improves your time with an SLO. Those are the only things that matter. If you can't actually describe how all that stuff at the bottom actually affects the stuff at the top, I'm not that excited.

Of course, there's no one true path. There's a lot of open questions, and so on. I'd want to see someone be able to navigate that, not just to talk about the goals like, yes, we want faster velocity, but how do you do that with observability? To describe it in practical terms, and going all the way up and down that stack. That's what I would really be looking for. I think you have people who talk in hyperbole about the top of the stack, and people who get lost in the weeds at the bottom of the stack, but it's rare to find someone who can actually connect those dots in a really clear and coherent way.

Butow: Those are some great tips for folks who might be doing interviews around observability in the future. It makes me think just pulling up, if you're able to draw a diagram as well of that pyramid, and talk through each of the items, as you're doing your interview. It's a really great way to explain to the person that you're interviewing with, your thoughts and your frameworks that you use.

Interviewing for Incident Management Understandability

I'd love to ask the same question but about understandability in regards to incident management in general. What would you interview for there? What are you looking for?

Egan: I think when we're talking to people about understandability with respect to this space, we're really thinking a lot about translating active and emotional cases into practical systemic solutions. It's a little bit of a EQ when you're interviewing around that space. Because it's an ability to walk that line between overly technical root cause analysis that results in potentially the system level task changes that need to go in to the organization to improve state versus the cultural and process changes that need to be implemented. The airline industry has gotten really wonderful at this, over the years in terms of, how do you train people into incident management regarding the response, but more importantly, the then learnings and understandability on the other end of it? In such a way that the learnings and reasonings propagate out into the industry and the organization in a practical manner. It's the best way I can describe it. You're really looking for an ability there to translate things into consumable and actionable results, not just to go and write novels like at the outcomes. I think we look really closely at that when we talk to people about how to hire into this space really, when you're looking for your incident commanders, or IMOCs that we had at Facebook. There's an interesting difference there, Facebook, Google, that I think you can dig really deeply into and have pretty deep conversations with folks. If you don't have both, you won't actually solve the long term resiliency challenges that you're trying to work through.

Butow: I know a lot of folks have probably written up postmortems. The next big step is like, what are we going to change after this incident happened and we've written up this postmortem? What are our action items? How are we going to make sure those get done? What did we actually do? Then afterwards, did they actually help us improve? It's really much more of a nicer circle instead of just ending it at the postmortem. That's a great thing for folks to think about. I'm going to circle back as well to talk more about IMOCs in that incident manager role. I think that's a really interesting conversation for us to have as well.

Interviewing for Someone Who Understands How to Observe and Understand Failure

You have a lot of experience in chaos engineering. What would you interview for when you're looking for someone who understands how to observe and understand failure?

Yee: I think it's a little combo of both what Ben and John said. Previously, when I was working at Datadog, and things, like observability in and of itself isn't the end goal. I think one of the challenges when I'm looking for people to hire, is that understanding of having a clear idea of how you fit into this system. What's the ultimate end goal? What's your contribution to it? Because if you can't define that, then you have no direction in what you're doing. Having that overall understanding means that I can trust you, given whatever incident happens, that you'll know what to prioritize, and you'll know how to act on that. I think that's a big thing, even when it comes to chaos engineering. We talk about reliability, but it's a pretty big blanket. What do we make reliable? How do we focus in on the things we need to be improving, especially with chaos engineering? Because it's this fun thing, but you can go around breaking anything, and having no impact on the system and improving nothing? You'll have a lot of fun doing it. It comes down to that, like, where do I fit in the system? How do we look at a system and get the most benefit from what we're doing? The rest of it, the chaos engineering and the tools, that's pretty straightforward.

Butow: It actually makes me think of a saying that I think is from Facebook. I used to always hear, focus on impact. Is that a Facebook thing? Yes.

Egan: That was definitely a red letter poster, one of the big ones on the walls. Yes.

Butow: I used to always hear that at Dropbox, when we were talking about incidents, postmortems, like doing chaos engineering. It was always like, focus on impact. Make sure that what you're doing is positively impactful in a good way. It's going to actually help move the needle, we always would say that. I really like that idea.

Incident Manager On-Call (IMOC)

Now let's circle back to the idea of an Incident Manager On-Call, an IMOC. That's what we called it at Dropbox. I know at Amazon they've also called it a call leader. I didn't actually know that IMOC originally comes from Facebook. What's it called at Google? Is it the same or is it different?

Sigelman: Google, despite having many SREs, there were different terms in use in different teams. I also left the company in 2012. The term incident commander, I definitely heard in my time at Google, but I'm not sure if it's something that was totally standardized. The Search SRE team behaved and operated in a totally different way than the Gmail SRE team. I can go into the whys of that. I think it was all for good reasons. I don't think that was as standardized as it was at Facebook, in my experience anyway.

Butow: Would you like to tell us about that, what's it like at Facebook, having an IMOC?

Egan: Just to set some context for folks listening, there's a couple of different terms for what arguably could be called the leader during an incident, but has different terms as it's happening. One of the more popular ones I think you see on a lot of organizations' sites are the incident commander, which Google had in their book in 2016, on SRE, defined pretty directly. I'm not sure if that evolved over time, or if maybe that's just the public version that they wanted to put out. At Facebook, we practiced a slightly different role that was called the IMOC, which was the Incident Manager On-Call, and who we internally joked was the responsible adult in the room. The core difference between those roles, at least as I saw it, was that the incident commander role was a very hierarchical and almost militaristic type role. If you go look at the definitions, and these vary on every company's site, but for example, PagerDuty has a set of definitions of how an incident commander should run. They're very aggressive. It's very much, incident commander, ask engineer responder one, is problem solved? If engineer responder one does not respond with a yes or no, remove engineer number one from call, insert engineer number two. It's a very aggressive approach to dealing with the micromanagement almost of an incident.

The IMOC, I always felt lended itself a little bit better to multiple leadership roles. Where you had an IMOC who was very well trained in incident management processes, and can help to make sure things like leveling of the server correct, can help to reach into the organization and across the organization to folks who maybe aren't contactable. The advantage of that is it permits the incident itself to also have a separate role that you might call the owner. That owner doesn't necessarily have to be someone who's perfectly trained. It's simply the person who's currently responsible for the incident as it's being resolved. It's a little bit more of a team effort. In an ideal world, the IMOC might do nothing other than just be present and aware of the incident happening, while the owner is actually going out and doing the work. In a more complex world, the IMOC might be coming in and helping reach across the org, if it's a PR impacting issue, or if it's maybe dealing with a cross team, and it's a new engineer who's being the owner.

I've actually found running Kintaba, that the incident commander role is far more adopted, I think, across organizations today, if you were to go and take the entirety of organizations, I think in the Valley, and compare what they call that role. I think incident commander has taken root, much more aggressively. I wish it would start to swing the other way, because I find the roles to be more about distributing accountability across the company, to the various owners. Versus just saying, ok, everything comes back onto the incident commander's shoulders. Not to phrase that as if that's what happens at Google. That's what I see it having translated into, in a lot of other parts of the industry.

Butow: I've definitely seen lots of companies taking up this role. I myself was an IMOC at Dropbox for a long time as well. Yes, you're responsible for every single system. You have to know about everything. It's a pretty big job. Also, it's a 24/7 rotation, which is interesting too.

Fair Retention Time for Tracing

Butow: What is your thoughts on a fair retention time for tracing? Do you have a quick answer there?

Sigelman: The question itself said that the value of tracing data falls over time, since they're less useful for identifying problems for possible improvements. Yes, it is true. I really want tracing sampling, which is what we're talking about here, to be based less on properties of the request itself, which is how things are done right now, and more based on hypotheses that are being examined or other things that are of interest. Like an SLO, for instance. If a trace passes through an SLO, we need to be able to hydrate that SLO with enough traits and data to be able to do real comparisons between an incident that's affecting the SLO and a baseline time when things are healthy. Even in healthy times, having a large volume of tracing data is quite useful for those sorts of comparisons. I want the sampling to be based on things like SLOs or CI/CD. Like we do a new release, we want to up this trace sampling rate. It's not just about time, it's about whether or not those traces are involved in some change, planned or unplanned, that's of interest.

Something Everyone at Every Company Should Do

Butow: What's something that you wish everyone did at all companies?

I actually wish every company paid people extra for on-call over the holidays, if they work that shift. Like if you work New Year's Eve, you should get paid extra for doing that on-call shift.

Egan: I'll be selfish in the incident space. I think companies should directly reward people who file the most incidents. I don't care if it's gamed, I still think it should be done.

Butow: I think that's great. Reward and recognition is important.

Yee: Companies should spend more time and give employees time to actually explore. We think of chaos engineering largely as reliability, but I like to think of it as just exploring and learning. I think that that's something that engineers need more of.

Sigelman: One of our values internally is being a multiplier, and it's a difficult thing to reward people for. It can be measured in certain ways. I think if a service is written well and thoughtfully, and SLOs are determined well and thoughtfully, the service will contribute a lesser portion to the ultimate amount of user visible outages, and you're being able to apply it for the product and for the rest of the organization. I would love to see people rewarded for doing more than their fair share in overall reliability of the system, because it's a hard thing to notice when it's happening. It's very important, I think, overall.

 

See more presentations with transcripts

 

Recorded at:

Jan 27, 2022

BT