BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations An Incident Story: Tips for How Staff+ Engineers Can Impact Incidents

An Incident Story: Tips for How Staff+ Engineers Can Impact Incidents

Bookmarks
48:16

Summary

Erin Doyle discusses her experience with a critical 3-day-long incident and how she missed a key opportunity to help prevent the incident and how Staff+ Engineers can influence a culture that can prevent similar situations.

Bio

Erin Doyle is a Staff Platform Engineer @Lob. For the last 20+ years prior she’s been working as a Full Stack Engineer with a focus on the Front-End. She’s also an instructor for Egghead.io and given talks and workshops focusing on building the best and most accessible experiences for users and developers.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Doyle: Welcome to tips for how staff-plus engineers can impact incidents: an incident story. I'm Erin Doyle. I'm currently a staff engineer on our platform team at Lob. Prior to that, I was a full stack web developer, and I had more of a focus on the frontend. I'm also an instructor for egghead with those courses on web accessibility. I'm here to tell you a story of one of the worst incidents I've been a part of in my career, and the lessons I learned from that experience. I'm going to talk about how I believe you as a staff-plus engineer, or a technical leader can improve many aspects of the incident process, from helping prevent many of the root causes to incidents in the first place, to managing the incident response better. Finally, to improving the quality of the post-mortem process, and determining action items to prevent similar issues going forward. We can't prevent all incidents. No matter what we do, incidents are going to happen. We can prevent some of them. Those we can't prevent, we can reduce the time to resolve, and learn from the experience.

A Real-Life Incident

A long time ago, in a galaxy far away, an incident occurred that took three days to resolve. Let me briefly provide you some context around the cause of the incident without divulging all of the gory details. At the company where this incident occurred, the infrastructure was managed via code using Terraform. The platform team had paved the road for product development teams to be responsible for writing the Terraform to manage the infrastructure for their applications. They could ask for assistance with this if they needed it from the platform team. Otherwise, they were responsible for knowing what infrastructure they needed, and how to write that Terraform code to manage those resources. Then the platform team would be responsible for reviewing those changes through a code review or PRs. Once the changes to the Terraform were approved in the PR, then the author was responsible for applying those changes from their computer. At that time, we didn't have an automated or centralized system for running Terraform that would create, update, destroy resources in our cloud environments. This reduced the transparency and traceability around infrastructure changes. We couldn't really see exactly what changes were applied by whom, and when. That really complicated things. The nature of the incident was a change to this Terraform code around a really critically important piece of our infrastructure.

The change enabled automation that would mark certain data objects as expired, and transition those data objects to a soft deleted state. After 24 hours, it would soft delete these data objects. If we hadn't caught and made the necessary fixes in time, the automation would then start to hard delete those data objects after another 24 hours. It was pretty critical. Since it took a day before we started to see any problems, and unfortunately, the issue initially presented itself in a way that was customer facing. Then, during that initial stage of triaging the fix that we put in to stop the bleeding of that customer impact, ended up inadvertently creating a secondary incident. Over the course of three days, it took senior and staff level engineers across almost all of our teams working together to restore almost every part of our platform. All of this stemmed from one fairly small code change. It was really a confluence of issues or misses that small and likely inconsequential on their own, when combined, allowed this change to slip through.

The Swiss Cheese Model

This scenario could be described by the Swiss cheese model. I first learned about the Swiss cheese model when I was working at the Kennedy Space Center for the shuttle program, and safety there was paramount in everything we did. The Swiss cheese model is a metaphor often used in risk analysis or root cause analysis to demonstrate the need for a multi-layered system approach to security or safety. When applying this model to software incidents, each slice of cheese represents a layer of our defenses against causing an incident. Perhaps some of these layers could be testing. Does the change do what was intended? Does the change cause any regressions? Code review. Is there anything the author might have missed? Do others understand the change? Does the change meet our coding standards? Do others agree and approve of the change? Or, deployment verification. Once the change has been deployed, is it achieving the desired result without regressions? As can happen with human processes, our defenses aren't always perfect. There could be holes. There could be things that get missed or aren't perfectly executed, thus reducing the effectiveness of that particular layer of defense. Hence, the Swiss cheese in the metaphor, with the holes in the cheese representing holes in our defense layers. This is why a multi-layer defense is so important. If one of the layers has a hole, then likely a subsequent layer will still be able to prevent the potential for an incident. This scenario, though, that this model represents is when all the layers of Swiss cheese happen to align in such a way that the randomly sized and placed holes of each layer just happen to line up, and thus can allow the root cause to pass through each of the defense layers without being caught and results in an incident.

Contributing Factors

When we performed a root cause analysis of the incident, we identified a number of holes in our defense layers. When reflecting on what we could have done differently, or what could be done going forward to prevent similar misses, we found there's really a lot of opportunity for technical leaders to have an impact on improving the environment that gave rise to these holes. Using our Swiss cheese model, let's look at the holes we had in our defense layers. For testing, the change wasn't tested in a pre-production environment first to verify it worked as intended. The change was approved in the code review without any questions or discussion. The change wasn't verified after it had been deployed to production to make sure, again, it was working as expected, and also, there weren't any unintended consequences. If we look at the root causes of the secondary incident that we created, while troubleshooting the first, we did test the change, but with a very narrow focus. We verified that it did what we were expecting, but we didn't perform a thorough regression test to make sure there were no intended results. We did perform a code review, but the reviewers were from a narrow scope of teams. We missed the impact that the change would have on a downstream part of our platform. Then we did verify the change after it was deployed, but again with that narrow scope, and not having additional eyes from other teams involved. We missed the negative impact the change had until later in the day when customers started to see it.

Incident Management Process

I want to briefly go over the process we were using for incident management at this company. Once an incident is discovered, someone uses an incident automation tool that we use to start the incident response. The automation tool creates a channel on Slack for the incident, a Zoom link for the war room, and post a notification on our general engineering channel letting everyone know this is going on. The incident is assigned a severity level between 1 and 4, with 1 being the most severe. Then someone will volunteer to take on the role of the incident commander. If warranted, a Statuspage is kept updated for the duration of the incident. Then once the incident is resolved, we hold a post-mortem debrief meeting to discuss the incident, share lessons learned, and identify any action items for improvement. The output of that debrief meeting will be a document recording all those details, the timeline of the incident, and then any takeaways, learnings, action items, and follow-ups from the debrief meeting.

Incident Response

As I mentioned briefly before, the problematic change was merged and deployed. Then 24 hours later, we discovered the incident. Throughout the duration of this incident, we continued to make missteps that ended up impeding our progress such that it took those three days to fully resolve. We were alerted to this issue by one of our monitors that alerts us when 500 errors returned from our API exceed a certain threshold. At first all we know is we're encountering an elevated rate of unhandled errors and that we're returning those errors to our customers. It was early morning for the Eastern Time Zone, and being a remote company, distributed across time zones, we didn't have a lot of people online yet at that point in the day. The few of us that were online were from different teams. Those of us self-volunteered to jump in and help triage the issue. We were disorganized, and each falling up on different leads trying to both assess the impact and find the root cause of the errors. Through the rapid messaging on Slack, there was enough evidence on the surface that probably a lot of customers were getting these errors. We hadn't fully identified what percentage of customers taking which actions were impacted, but we had a perception that the rate was high, and that immediately raised the pressure and stress level of the effort.

As soon as we declared the incident to be a severity 1, and updated our Statuspage to communicate the outage, those of us troubleshooting the incident immediately put on blinders. We pivoted to rushing to come up with a solution to stop the bleeding rather than continuing to focus on discovering the root of the issue and understanding its full impacts. This is where we introduce the secondary incident. The fix we put into place to stop the bleeding ended up having downstream impacts that those of us who were online at the time didn't anticipate. Only after deploying this stop the bleeding fix, did we really put a concerted focus into identifying the cause of the errors we were seeing. While we were all busy looking into that, we missed the impact of the fix, which was now affecting customers in a new and different way from the original 500 errors. As the day went on, we ended up with three different groups, all looking into three different issues stemming from that same root cause. We were also heads-down. Without anyone looking across the big picture, it took longer than it should have to correlate and tie it all together. On this day, we had the unfortunate coincidence of our observability platform provider having an all-day outage as well. That dramatically reduced our ability to research into the issue.

By the end of the first day, fatigue had started to wear on the incident team. The second incident had been identified, and there were now multiple workstreams running in parallel, split up to resolve both the initial and secondary incident root causes, and figure out a solution to restore the massive scale of data in the fastest and safest way possible. People were still all working voluntarily, self-organizing into workstreams and efforts, and burnout and exhaustion started to ramp up and people started to drop out. Some would take breaks without providing clear expectations as to when they'd be back online. As each time zone reached end of day, different people would decide to either call it a day and log off, or stay on and keep working extra hours. Now it was totally unclear which deliverables were being worked by whom, when they could be expected to be done, when the next check-in might be, when someone with work blocking someone else would be back online. There were times when maybe a particular person, or skill, or domain knowledge was needed, but no one was coordinating or pulling in those resources. People were working out of their comfort zone with little sleep and a lot of stress. Everything took longer, and tensions got higher as the incident stretched on. The happy ending to the story is we did finally at the end of that third day get all the critical issues impacting customers resolved. We were able to restore all of the data with no irretrievable losses.

Post-Mortem

We very quickly pivoted to focusing on the post-mortem for the incident, to make sure we thoroughly captured all of the feedback and lessons learned from everyone involved. We were seriously motivated to make the right improvements to prevent similar incidents like this from happening in the future, but just as much to improve our incident management process itself. Incidents will always happen, but if we can improve our process around managing them, we can improve our efficiency and effectiveness in resolving them. There's a lot of opportunity for technical leaders to have an impact on the post-mortem process and improving the quality of the outcomes. We use an incident management automation tool. We have a template to autogenerate that post-mortem document. We've added to the template over time. We've got guidelines and a format for conducting the post-mortem debrief meeting. We even start each post-mortem debrief meeting reading the prime directive from Agile retrospectives to remind everyone of our blame-free focus. We run a blameless incident process, meaning that we do not search for or assign blame or even attribute causes to individuals. Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available and the situation at hand.

Even though we have this template to follow, we noticed after a little bit of time after rolling out the incident automation tool, that people running these debrief meetings, were just reading the template and going through the motions. They were checking the box that they'd held the debrief meeting and wrote the post-mortem doc, but the quality was low. There was little effort put in to the root cause analysis, or retrospecting over the incident management process. There were few action items, and those that were identified were often forgotten about or not followed through with.

Process vs. Culture

This is an example of where culture can make the difference over process. We had added prompts to the template to try to encourage meaningful conversation and identification of action items for improvement. Just because something is written down, doesn't mean people will be influenced or motivated to do it. When we were seeing people going through the motions in these meetings, wasn't because they were being lazy, it was most often because they were being run by engineers who were new to managing and retrospecting over incidents, and they didn't know how to drive those conversations. This is where staff-plus engineers can have influence. Even if not running the meeting yourself, you can ask probing questions. You can show curiosity without judgment. You can model and direct that root cause analysis. You can help determine pragmatic solutions for action items. Even if you weren't directly involved in the incident, I strongly encourage you to attend post-mortem debriefs and read the documents so that you can influence the process and help raise the quality of the output. An incident really is a critical opportunity to learn lessons and identify improvements that could be made. Your involvement can help make that happen.

Root Cause Analysis

In the post-mortem debrief meeting, after summarizing the incident details and timeline, it's critical to perform that thorough root cause analysis. Part of your charge is to help identify actions that can be taken to prevent similar incidents from occurring in the future. Just as important as identifying potential solutions, is picking the right fit solution for the problem. We could look at the contributing factors and just throw more process and gates at the problem. To some extent, that might be appropriate. Sometimes more process and gates can just slow things down unnecessarily, especially if they're just band-aiding the symptoms of the root of the problem. When we're performing a root cause analysis, we shouldn't stop at that first contributing factor. We should really retrospect on each contributing factor and iterate over asking why until there are no more answers. Then that's when we've hit the root cause. When you perform this level of analysis, what you often find is that there's something about the environment or culture that contributed to a given hole in our defense layer. Here's the big opportunity for you as a technical leader, if you can help identify what that hole may be in your culture, and determine ways you can influence and improve it, this may be the ideal solution. You can make the environment healthier for everyone and prevent adding process and gates that may just impede progress.

Asking Why

Let's go back to our Swiss cheese model of the incident with that first set of holes in our defenses and dig a little deeper asking why. I'm going to be identifying areas of our culture that we had room for improvement. I want to be clear that we already had an amazing supportive, inclusive culture. No environment is perfect or even static. A really great culture can still have blind spots, or areas that are exacerbated by specific circumstances, which was very much the case for us here. This process shows that there's always room to continue to put care and feeding into your culture. It's never perfect and it's never done. It has to be constantly attended to. As technical leaders, you can be the champions for attending to that culture. If we now ask why. Why was the change not tested in a pre-production environment first? We could certainly answer this question by saying, there was no gate requiring proof that testing had been done. Shouldn't testing be part of the development cycle? Shouldn't that just have been part of the process that the author should have known to follow? We could go down a path here and ask questions now about the culture. Why did the author not know that they were expected to test their changes? Maybe we could uncover something missing in our environment that we could improve the understanding of. Maybe it's the importance of testing: why we test, how to test effectively. The expectation of testing quality and quantity that developers should be performing on their changes.

In this case, though, the author knew the importance of testing as a best practice. They didn't test the change, because they didn't know how to put together an appropriate test plan for this change. If we ask why they didn't know how to test this change, it was because it was in an area of the system they were unfamiliar with. We could then ask, why didn't the author ask for help if they were unfamiliar and didn't know how to go about testing this? Unfortunately, we didn't ask the author this question, and therefore we don't have any answer to work with. As you can imagine, this is a tricky question to ask people and requires a lot of trust and vulnerability for someone to answer honestly, which is likely in and of itself a contributing factor. From this, we can speculate that the author may not have been comfortable asking anyone for help.

When trying to dig further into the contributing factor that the author didn't feel comfortable asking for help, without knowing why they felt that way, can be challenging. This is an important aspect to your organization's culture that requires constant care and feeding. As staff engineers, you have a lot of influence here. Any time spent asking, how can we make our peers feel more comfortable asking for help, and any effort made towards improving that aspect, is well worth your time. Here we can see that culture could have been another line of defense in our model. We can surmise that if the author had known they should have tested their changes as development best practice, and felt comfortable reaching out to get help to make sure the change was properly tested, the incident possibly could have been prevented.

Moving on to the code review layer of defense, we can ask, why was the change approved without any questions or discussion? Here's another example of where process or gates aren't always sufficient enough to prevent certain outcomes. We can require a code review, and even so many approvals on that review, but we can't force people to speak up and ask questions in those reviews. This was a bit of a perfect storm of assumptions between the author and the reviewers. First, the author assumed that if there was anything wrong with their code changes, that one of the reviewers would point it out. As we've already established, the author didn't test their changes. With that, they were heavily and silently relying on someone else to be more familiar with the code in context to spot anything that may not have been correct. We could dig into this contributing factor and ask, why was the author not explicit about not testing their changes, and that they weren't 100% confident in them? Probably the answer is going to be similar to whatever circumstances led them to feel uncomfortable asking for help in the first place. If we go back to the problem, and ask the question why again, we'll get additional contributing factors. The reviewer was somewhat new to the company at the time. They understood the code changes, but they didn't understand the context. The changes were an area of the platform that the reviewer wasn't yet familiar with. They didn't understand the significance. They gave the author the benefit of the doubt and assumed that the author thoroughly knew what their code changes would do, and had already tested them to make sure they worked as intended.

Finally, I had reviewed the PR myself. In fact, I may have even been one of the first people to look at it. It's with some vulnerability that I admit to you all that I also didn't ask any questions or speak up at all on that PR. If we iterate over asking why here, why didn't I ask any questions on the PR? I didn't fully understand the code that had been changed, but it felt wrong to me. I knew it was a change on a pretty critical, important piece of our platform but I didn't fully understand what the change was going to do, or more specifically, what the side effects are, or second-order downstream changes would be. I had just moved to a new team recently. Even though I was concerned about this code change, my insecurity around admitting that I didn't understand it to my new team prevented me from saying anything. If I had asked the question, if I had had the courage to be vulnerable, and show that I didn't know something publicly, maybe that conversation could have helped us prevent this incident. We can now take each of these contributing factors and keep iterating over asking why, and we'll uncover opportunities for improvement all along the way. Again, we could slide another layer of defense or cheese into our model that could have prevented this incident before it got through the code review layer. As I touched on earlier, though, solving the problem of people not feeling comfortable asking for help, or asking questions, or speaking up with concerns, or admitting when they don't know something, it's not simple or cut and dry to solve. This is a human psychological issue. You can create the most supportive, inclusive environment, and people may still feel uncomfortable or insecure at times. As leaders, we should always be working to improve the environment to be as inclusive for everyone as much as we can.

Finally, let's look at that last layer of defense, deployment verification. This depiction is a bit misleading. Really, by the time the change was deployed, and we could perform verification on it, then the incident would have already started. Regardless, if we had been monitoring that change to verify it was working as intended, we would have detected and been able to resolve this incident much faster. We can ask the question, why did the author not verify their changes after they were deployed? I don't have the actual answer to this question either. My guess is that it was a lack of clear expectations compounded by the author's lack of understanding how to verify their change in the first place. Here's another opportunity to look at improving our culture to buffer our defenses. It could be baked into the culture that development and testing best practices are frequently shared, and expectations are made clear as to what tasks are the responsibility of the developer or author in order to meet the definition of done. Staff engineers can play a big part in establishing this. They can model this behavior as well as overseeing and coaching their teammates to follow these best practices.

Retrospective - What Could We Have Done Better?

It would be a major missed opportunity if the post-mortem process didn't also include retrospecting over the incident management process itself, or response. This is the best time to reflect over what went well and what could be improved upon. Staff-plus engineers can play a major role in improving the incident response. With your experience and ability to delegate, plan, look at the big picture, you are the perfect candidates for stepping up and helping drive and direct the incident response to be more efficient and effective. When we retrospected on our response to this incident, here are the big items we identified that we could have done better, and each one had an opportunity for a technical leader to have effected a better outcome. We know pressure causes to rush and not think clearly or thoroughly. We didn't slow down to finish answering important questions and get the full picture, or wait to get more eyes and perspective before rolling out more changes to attempt to stop the bleeding. This is both an acute situation as well as an overall culture issue. In the moment during the incident response, especially during that initial triage stage, when stress and worry may be at their highest, a technical leader can really step in and set the tone. Communicating with an even-keeled, calm demeanor can help lower the pressure and stress felt by others. It could project a sense of the situation being under control, reduce the fear, and inspire confidence. The words you use in your tone is very important here. You can either help calm people down or you can spin them up. Even if you're nervous, which you're allowed to be, the more you can model a collected cool-headed attitude, the more you can help slow things down, will allow people to think more clearly.

Even before getting to the incident, there are frequent opportunities for staff-plus engineers to model handling stressful or high-stake situations with steady assuredness. We can demonstrate that we may not have an answer or know something yet, but we can still slow down and work systematically through the problem. We can approach the situation with confidence that we will figure out the issue and solve the problem. You don't need to be disingenuous about it. You can be open about not knowing something, and that research and triaging will need to take place. That demonstrates even better that it's ok not to know everything all the time. It's ok if you don't know what's wrong, or how to fix it yet. It shows others that we can still approach unknowns, even when stakes are high, with a level head rather than fear. Another change you as a technical leader can influence is to foster a culture of learning and working in public. When you model and encourage others to work and learn in public, it breaks down a lot of the intimidation around not knowing something. It normalizes the activities about being open about what you don't know, what you're still learning and figuring out. What you're trying and maybe sometimes failing at, and what your process is. You can deflate the fear of judgment by going first and showing your work.

When stress is high, and people may feel pressured to rush or are fatigued from too many hours of high cognitive strain, that's the time when we can so easily miss things and make mistakes. The more experience people have with working in public, the better they can work through these situations. This is the time when people need to be showing their work. Explaining out loud their steps, their thought processes. What steps they plan to take next, and getting other people's eyes on all of that. We need more sanity checks and safety nets than ever during these times. If this practice is one that we just follow all of the time, because we've been modeling it or mentoring others to follow this process, then it becomes second nature, without fear of judgment. We just calmly and humbly start working out loud with statements such as, "Here's what I've done. Here's what I know. Here's what I don't know. Here's what I think I should do next. Is there anything I've missed? Does anyone have any other ideas, or suggestions, or thoughts?" This practice of working out loud provides space for people to speak up and ask questions and propose ideas. When you ask people for their input, you lower the barrier a bit, and make it feel safer and less intimidating for them to speak up and say something.

Another lesson we learned from the incident response was that either we didn't have an explicitly identified person serving as the incident commander, or when we did, much of the time, that person was also deep into triaging, troubleshooting, and fixing. They were too focused on their own work. This made it very difficult for them to be able to adequately look at the big picture, manage updates, and communication with stakeholders, or pulling in the right help when it was needed. We also didn't have anyone coordinating schedules, expectations, or deliverables. People were just self-volunteering for things or not, logging on and off whenever they chose to, without explicitly handing off work or communicating their planned schedules. A staff engineer is a great candidate for volunteering to assume the incident commander role. You don't even need to be on the impacted team. In fact, it might even be better if you're not.

An incident needs someone that can keep their head above the tree line and on the big picture. An incident commander shouldn't be someone that's in the weeds, troubleshooting and solutioning. Instead, you can be the one gathering status from those that are working heads-down, and handle the communication with the stakeholders, thus allowing the hands-on folks to stay focused on their work. You can be the one that coordinates and removes blockers. If the hands-on people need something, you can help get them what they need so they can stay on track. You can see if there's a need for a subject matter expert or someone with certain skills to help out, and you can coordinate resources and pull them in. You can make sure there's clear communication and expectations as to who's working on what, what their schedule and plans are. If there needs to be a handoff, making sure that's done thoroughly.

A final note about effective incident command is that the incident commander can also advocate for themselves. If you're the incident commander and you need a break, or you're realizing that you're no longer the best fit for the role at this time, maybe you need to go heads-down, whatever the reason. You can ask for someone else to relieve you and take over command. No one should be hesitant to assume command thinking they'll be stuck with it for the duration. Really, any of the roles and responsibilities throughout the incident should be able to be fluid as the situation changes. We just need to be explicit when those things change. These are all skills and capabilities that fall into your wheelhouse as a staff-plus engineer. Having an effective incident commander can really make or break the success of the incident response.

Finally, even though we were seeing that various things weren't going well all along the way during this incident, no one spoke up. No one wanted to hurt anyone's feelings, or imply that anyone wasn't doing a good job, so we weren't able to make any needed course correction sooner. I certainly had an opportunity to do something here, and I admittedly fell prey to this myself. I knew everyone was doing their best. We were all working hard under a lot of stress and fatigue. I didn't want anyone to feel like by my suggesting we do something differently, that I was criticizing the work they were doing. I didn't say anything until we got to the post-mortem. What I could have done or others could have, is demonstrate that we can humbly and respectfully make observations without blame or judgment, and we can propose solutions without emotion. Or, if we don't necessarily have a solution to propose, or would rather make solutioning more collaborative, we can ask non-leading questions with detached curiosity to drive discussion amongst the team, to see if any course corrections should be made in real time.

Some examples of questions we could have asked before making any decisions could include, are we headed in the right direction? What other resources do we need? What are the downsides of these changes? What areas will this change impact? Or even, explain this to me like I'm 6. Questions we could have asked about the incident response as it was ongoing, what could we do to improve our incident response right now? What, if anything, could we do to improve communications here? What teams, groups, subject matter experts are we missing that we need represented here? Are we delegating enough? Are we rotating people in and out enough? Or, what resources can leadership help us with now? If we can foster an environment where people can both give and receive feedback without taking it personally, we can lower the barrier for people to speak up and offer suggestions for improvement, and then we can pivot sooner when we see something isn't working and we can try something different.

Follow-up and Action Items

When it's time to identify action items and follow-ups, this is an important point where a technical leader can help raise the quality of the post-mortem. It can often happen that after all the energy of the meeting has gone into the retrospection and the discussion over the incident, that people are ready to conclude without spending more energy trying to solution all the findings. This is a time where your leadership can help drive improvement based on the lessons learned from this experience. If we skimp on this last step, or don't follow through, then we miss out on a golden opportunity to improve our environment. This is where you can help improve the productivity of the meeting. Drive the conversation towards the items that were identified that could have been done better. Ask probing questions. Support the discussion. Help keep statements productive, moving towards ideas and solutions versus judgments or complaints or tangents. Help keep the meeting moving and on time and on point. If there isn't enough time to thoroughly determine your action items, then maybe a follow-up meeting needs to be scheduled. Or maybe some high-level action items can be identified that include researching and proposing ideas for discussion and consideration later, rather than having developed all solutions 100% in this meeting. Then make sure tickets and whatever work tracking system you have are created so that there's accountability and traceability that these action items have been documented and the status of the work is transparent. If you're on the impacted team, you can take ownership over overseeing that these action items are prioritized and worked in a reasonable amount of time.

Finally, a very important part of this process that you as a staff-plus engineer can play a part in, is help drive the solutioning, the determining of the action items to be the most practical and to be the right fit. This is where you can analyze the root cause and contributing factors, and brainstorm solutions. Then you can assess those proposed solutions to determine which may be the appropriate fit. If the solution is a process change, or addition of a gate, you can ask questions like, is this solution just to a symptom of the root problem? Is this a band-aid to something that is happening as a result of our culture? Will this solution impede us in a way that's a tradeoff that we don't feel is acceptable, or is there another option, while also imperfect is maybe more acceptable? Can we depend on this solution or is it something that could be missed, mistaken, or even subverted? If this solution will prevent a given undesired scenario by improving our culture, but has a long runway, maybe we can come up with something else to do in the short term that will buy us enough time to work on that long term culture change.

Key Takeaways

You can actually prevent all incidents from occurring with improvements to your engineering culture. You can reduce a number of incidents, overall, and dramatically improve the time to resolve the incidents you can't prevent. You can model, foster, and encourage that culture. You can serve as an incident commander and have many opportunities to improve the incident response. When incidents do happen, you can help get them fixed effectively and efficiently. You can improve the quality of the post-mortem process. You can drive the solutioning of action items to best fit the improvements needed to improve your environment and prevent similar incidents in the future.

Questions and Answers

Participant 1: You mentioned the idea of having someone who may not be involved with your team actually be part of it. I do like that a lot. I've been in that role before. I even think of the incident commander thing, there's still that execution mindset I find a little bit, focused on, what do I do now. Where I think I've seen some benefits, having more of a facilitator also there to be like, remember to do this and that. It's like, are we going to need some extra heavy little coach on the side there? I've found like that's been really big. Because, not only you're not involved in the domain, but you're involved in the actual who gets assigned what? How do we do the thing where you just sit down and really see that high level? I don't know if you've experimented with that a bit, where it's like coaches on the separate from the commander a little bit.

Doyle: I haven't seen that. I think that's an awesome idea. I think part of what inspired me to include that point that I've seen when an incident pops up, a lot of people are like, "That's not my team. That's not my app. I'm going to go back to what I was doing." Really, the more we can have people pitching in, whether they are on the team or not, and also having maybe somewhat of a coach, the better this is going to go. If everyone isn't just like, not me, not my problem. If people, leaders across the platform can jump in, and can point out, keep people on track, or may even see something that the people that were on the impacted team don't know about or might have missed. I strongly recommend that we don't do the, "I'm busy right now, and it's not my thing." We take a minute to jump into that incident channel and see if we can help out to just really improve the process.

Participant 2: What I've seen recently is like a rush to resolution, and everybody hopping on it and trying to get it buttoned up in like an hour, when really, we should have gone into it with a long ball strategy. Like, let's be deliberate, let's make sure we get it. Let's make sure we don't make a second mistake. Had we done that we maybe would have got it in 6 hours. Instead, because we were rushing and swinging for the fence on everything, it ended up taking 36 hours. Curious your experience, perspective on that one as well?

Doyle: You have just described this exact scenario. That's exactly what we did. We went in with such panic and pressure of, "This is bad, customers are impacted, must fix ASAP." We did skip over like, let's figure out why this is actually happening first, and then decide what to do about it. Let's get the full picture, how many people are being impacted and in what ways? We absolutely would have fixed this in maybe a day versus three. That's why I talked about, as staff engineers, when we can really show that we can slow down, let's not rush. Let's not jump to conclusions. Let's ask those really important questions and let's get answers first. Then come up with a game plan. Let's just try to be calm about it, because, really, in the end, we're probably going to be better off if we slow down now, and come up with a good game plan than just rushing and flailing, and then making things worse.

Participant 2: I've been burned recently by multi-day issues that did involve that. Curious if you've had experience with this, I've been trying to coach people, don't make any assumptions. Often, these are complex and compound problems. People tend to say, this worked yesterday, ergo, it cannot be related to this incident in the past. When in fact, it is one of three bad things that happened all along. Have you seen that? Have you had any luck there?

Doyle: I think we have this instinct of like, narrow it down as fast as we can. What can we take out of the picture and not look at, and not get distracted by? I think we need to do that in a way that's like, I think maybe it's not this because we have this evidence, but we need to really prove it out. Let's not throw it out immediately. Let's put it on a list of, it might not be this, but let's not look at that too definitively because then, again, we put on the blinders and really the problem's right here, but we're not seeing it because we thought we had taken it off the list.

 

See more presentations with transcripts

 

Recorded at:

Jul 17, 2024

BT