In this podcast Shane Hastie, Lead Editor for Culture & Methods spoke to Vanessa Huerta Granda Manager of Resiliency Engineering at Enova about resilience and incident management.
Key Takeaways
- It is vitally important to consider the sociotechnical aspects in tech systems
- Resilience is the ability to sustain challenges and recover from failures
- A culture of resilience means that people in an organization can handle whatever is thrown at them and work together to find solutions
- Good incident management involves allowing engineers to do their best work and focusing on communication and coordination
- After an incident, it is important to have a retrospective to learn from the experience and make improvements for the future
Subscribe on:
Transcript
Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture podcast. Today I'm sitting down with Vanessa Huerta Granda. Welcome. Thanks for taking the time to talk to us today.
Vanessa Huerta Granda: Thank you for having me.
Shane Hastie: So Vanessa, you were the track host for the Resilience Engineering, and I love the second half of that, Culture As a System Requirement track at QCon San Francisco. Let's start delving into the track. What was the message you were trying to convey when putting that track together?
Resiliency in sociotechnical systems [00:52]
Vanessa Huerta Granda: I, and a lot of folks from the Learning from Incidents community, when we're thinking about our tech systems, we don't just think of them as technological systems. We like to think about the sociotechnical system. And so socio, that people part is part of our systems. Everything that we do, it depends on people, it's running because people are making decisions at some point or another. Maybe when something is getting first developed, when something's first getting architected, but also when you're maintaining something, when you're handling issues, incidents, when you're learning from them or anything like that. So the idea here is that culture is part of the system, is part of that sociotechnical system. And resiliency, making sure that your organization is a resilient culture is a huge part of that.
Shane Hastie: Taking a step backwards, what brought you to focusing on this? What's your background?
Introducing Vanessa [01:43]
Vanessa Huerta Granda: I am an industrial engineer, which weirdly, actually fits in perfectly with resiliency work, with understanding how it is that people work, how it is that people create the software, write it and maintain it. So I have worked in operations as a Site Reliability Engineer and leader for the past decade. I am currently the Manager of Resiliency Engineering at Enova, and previously, I was at a startup called jeli.io, focusing on products to help people handle their incidents, learn from them, the entire lifecycle.
Honestly, I got into this kind of work because I just really enjoy solving problems and I really enjoy talking to people. And at some point, a boss of mine realized that was actually a good fit for this kind of work, and so I've been doing it ever since.
Shane Hastie: So going right down to first principles, what do you mean by resilience?
Defining resilience [02:34]
Vanessa Huerta Granda: A mentor of mine once said that resilience is something your system does rather than what your system is. Resiliency is our ability to sustain challenges, to sustain fractures, failures, whatever it is. I think we often think that having a resilient system means that our system is never going to break. That's just not true at all. It means that your system, including the sociotechnical system, sometimes something bad is going to happen, something is going to break, and how do we recover from that?
Shane Hastie: And what does a culture of resilience look like and feel like?
Vanessa Huerta Granda: Well, I can tell you I've experienced that all week. So a culture of resilience really means when the folks that are working at your organization are able to handle whatever is happening, whatever is thrown at them. So we have this idea that we have these plans for the quarter, we're going to get so many projects done because we have so many T-shirt size things that we're going to do. But at some point, something is going to happen, someone's going to go on vacation, someone's going to get sick, or maybe the code that we understood actually doesn't work the way that it does.
And so that's when you have your socio part of the system figuring out a way to make it work. And that can be through automatically having failovers in your system or just having a process where people talk to each other and figure out like, "Oh, hey, can you help me with this? Can you help me with that? Let's prioritize this. Let's prioritize that."
Shane Hastie: Incidents and resilience, of course go hand in hand, but incident management in my experience is something that many organizations do haphazardly at best and often very badly.
Vanessa Huerta Granda: I hate that.
Shane Hastie: So what does good incident management look like?
Good incident management [04:11]
Vanessa Huerta Granda: Oh my gosh, how much time do you have? When I think about incident management, I think about allowing your engineers to do their best work. So as an incident manager, I am not here to yell at anyone, and I hate this idea. We often think of the person leading an incident, we call them incident commander. I can tell you that when I first started this job, I was 25, the only woman in the room, the only Latina woman in the room, and I definitely did not feel like a commander. What I did feel like was somebody who could talk to people and get them to actually discuss what was happening and figure out the best way forward. Incident management that actually works is understanding that we're all working for the same team, we're all working for the same goal, which is to just get our systems back to normal and that we need to work together to do that.
The other side of the coin is that when you're in an incident, if your company's making money, that means that during the incident you're not making money or something bad is happening. And clearly people are going to care, clearly your stakeholders, your leaders, they need to be aware of what's happening. And so I like to tell the responders to, "Worry about responding, worry about doing the engineering things. Use your brain towards that. I'm going to focus on the communication. I'm going to focus on the coordination and the collaboration, and I'm going to make sure that you're not answering a million things from your CTO, that you're not worried that your CMO is going to be upset because the website is down. I'm going to take that away so we can work towards resolution."
Shane Hastie: And then what happens afterwards?
Vanessa Huerta Granda: Oh, my favorite part, you gossip about it. So we have an incident that's over, and I work from home a lot more nowadays than I did back then, but you're outside of the office and you're outside of the war room, whatever you want to call it. And you're talking about it, right? You're discussing, "This is something that happened, that is something happened." Maybe you go to lunch, maybe you go to happy hour. People are always going to talk about it. In cultures where there's not that culture of resiliency, you have a postmortem, and that postmortem is usually some sort of document, that incident report that tells you, "This incident started at 5:00 AM and it was over by 7:00 AM and it was Shane's fault, and Shane really sucks."
Shane Hastie: Yes, the who-can-we-blame session?
Vanessa Huerta Granda: Right?
Shane Hastie: That's a really, really important part.
Vanessa Huerta Granda: Oh my gosh. And that to me is really, really sad because an incident is just this big red arrow pointing to you towards something that is happening at your organization that you can learn from. And so, I like to think of incidents as learning opportunities. So a retrospective can be that people are talking about your incidents either way, people are talking after the incident, I'm certainly slapping my work bestie to be like, "Hey, can you believe that happened?" Or during lunch afterwards we're going to talk about our incident and so we might as well talk about it together and learn from it.
So usually what I like to do during a retrospective is make sure that people are sharing what happened from their own point of view because at my previous role, I was doing a lot of consultant work. I was not an engineer, and for the first time when I was in an incident, I wasn't seeing what was happening in the code. I was seeing what our customer was experiencing. And so that is a different point of view than the engineer, and that can certainly make a difference in how we move forward. The idea is that you have a learning review, a postmortem, retrospective, whatever it is that you want to call it. You're learning from that. And then you're coming up with action items that are helping move the needle in the future.
And it can be as easy as like, "You know what? Maybe we need to have better post-release checks." Or maybe it's something like, "You know what? This process that we've been working, it probably doesn't make sense anymore. It made sense back a year or two years ago. It doesn't make sense anymore. Maybe we need to do some more training," et cetera, et cetera. There's many things that you can learn from an incident.
Shane Hastie: How do we avoid it being that blamestorming activity?
Avoiding blamestorming – turn postmortems into learning opportunities [07:59]
Vanessa Huerta Granda: Well, that's the part of the culture, right? That's where you have to as an engineering leader and as individual contributors, make sure that when you're leading a retrospective, when you're leading a postmortem, that you are not just filling out a document, that you are speaking up and you're letting other voices heard. So there's a lot of good information out there. There's the Etsy's Debriefing Guide. There is the Howie Guide from jeli.io. I co-authored it a few years ago, and it helps people understand how to best position themselves so they can turn their postmortems into learning opportunities.
From my standpoint, I can give you the best advice that I can give you is to... If you're leading a retrospective, never be the only person that's speaking. Let other people speak up, let other people share from their points of view what happened and be forceful of that, "This is not a blaming game. We're all on the same team."
Shane Hastie: And communicating the outcomes. What I want to explore there is getting off the hamster wheel of just incident, incident, incident response and breaking what feels to me at times like a never-ending spiral for folks.
Vanessa Huerta Granda: I think I've mentioned this earlier, I like to think of incidents as an incident lifecycle where you have your incident, something breaks, then you have your retrospective where you're learning, out of the retrospective, there's action items. And so that's feeding into the process and there's things that you can do throughout that entire lifecycle to make things better. This is the part where I can give you my example that actually, let me understand... Not let me understand, but really highlighted how the incident lifecycle can be applied to anything.
I currently have 2-year-old twins and they're my only children. When they were first born and we took them home from the hospital, it was the craziest incidents I had ever had in my life. And I had had a lot of incidents, but it was kind of bananas. I was like, "Shoot, I can't sleep at all." And my husband couldn't sleep either because one wasn't crying, the other one was. And so we felt like we were in this hamster wheel fighting this incident over and over again, and we just did not have the brain power to do anything about it. And so what we did was like, "Let's try to make this process easier for us." And that's what I recommend a lot of organizations start doing. Make the process easier for your responders. A lot of organizations start introducing tools, introducing Slack bots or team spots, whatever sort of chat platform you're using, start communicating, automating some of the things.
And what we did is we hired an night nanny. So once you have the bandwidth, you're out of constantly, constantly fighting incidents, then you're able to start putting those productive retrospectives, productive postmortems and start figuring out what it is that you can do at a higher level. Right? At that point, that's when my husband and I were like, "You know what? We don't have to drag our two infants in their car seats to Costco. We can just have the diapers delivered." So we're thinking of a change to the process that's going to make the entire lifecycle easier, not just one specific incident. And I think I gave that example earlier, right? After an incident, maybe once we are able to have a retrospective, we realize, "You know what?We need better controls for this processor. We need to have a testing suite that makes more sense or blue-green deployments," whatever it's that you want to call it. And so those action items make it easier, lead to fewer incidents, give you more bandwidth.
And then the last thing that we like to do is cross-incident analysis where you've made those easy changes, you've addressed the low hanging fruit and then taking a holistic look at your incidents. Maybe you're taking a look at all the incidents that you had in the last quarter, and you're able to say like, "Okay, you know what? This team has a lot of incidents. Let's try to maybe give them a little bit more headcount." Or, "It seems like these two teams are working on something similar or it seems like a lot of these incidents are related to this antiquated pipeline. Let's maybe give more resources to do all of this." And so you go from making small changes to the incident process itself to making changes out of those incidents. And then you're making larger transformations.
And actually, during my time, we were able to make a case for changing, for going into more of a DevOps organization, moving away from an ops team that everything was funneled through them, through having the SRE team. And that's the funnel that not everything goes through them because we were able to look at the incidents and because we were able to try to find patterns there. And I did the same thing with my children and now I love being a mom.
Shane Hastie: Two-year-old twins, your life is a hurricane.
Vanessa Huerta Granda: It's fun. I do incidents for a living.
Shane Hastie: One of the things that we touched on earlier, but I would like to dig in deeper is stress and burnout. How do we help folks reduce the stress and avoid burnout? Because we certainly know that burnout is a significant issue in our industry at the moment.
Reducing stress and avoiding burnout [12:51]
Vanessa Huerta Granda: Absolutely. We've seen it everywhere, right? Burnout. I don't see people getting burnt out because they're working so much, as much as they get burnt out because they're working hard, but it never stops. Nothing ever changes. And that's why I am so passionate about the learning part of things and not just the resolving problems, but if you're in the hamster wheel, I want to hear what you are seeing and I want to see where it is that I can help. And that's why I like to make those... You have those short-term action items that can help maybe things a little bit, but then when we're making those higher level recommendations, we include things like, "Let's add more headcount because things are hitting a fan." Or you always hear engineers saying, "The problem is that we have this outdated architecture that made sense a while ago, but no one's going to put the resources."
Well, if I'm the person that has all of the data around incidents and I'm able to go up to your leadership team and tell them, "All of these incidents that you're having, maybe you should put in some resources into doing something different." That's going to allow people to see that their hard work isn't just for nothing. I'm also a manager of my specific team, and I take that very seriously and I make sure to have personal connections with them and make sure to give them the time that they need to rest, to listen to them, and to be proactive about that, right? Like, "If you were up all night working on an incident, please don't come in today. Please take some time to sleep." And the same goes with your personal life, right? Like, "If you were handling your three-month-old baby overnight, there are more important things out there."
Shane Hastie: You are a manager, you're a leader of a team. A lot of our audience are stepping into that role, often for the first time. What advice would you have for them?
Advice for new leaders [14:40]
Vanessa Huerta Granda: When you're an individual contributor, it's sometimes hard to understand the constraints that management is working with. I think becoming a first-time manager, I wish I had given myself a little bit more grace and realized that I can't change everything. One of my mentors actually, her mantra was, "Grit and grace." Yes, try to work through things with grit, but also give yourself grace. Give other people grace. No one's out to get you. And I feel like it's taken me a little bit to realize that, especially when you're working with incidents, when you're trying to work with people from different functions, they're all working with their own constraints. And so remember that you're on the same team, I think makes a lot of difference.
And then when you're managing your own team, I mentioned that I take that very seriously. These are people's livelihoods that I have on my hands, right? I'm their manager, and so they spend a lot of time working, and I just want to make sure that I'm listening to them, that I'm understanding where they're coming from, not making assumptions, giving them grace as well.
Shane Hastie: Grit and grace. I like the combination, grit and grace. Thank you very much. Vanessa, if people want to continue the conversation, where will they find you?
Vanessa Huerta Granda: I guess on X is now what it's called, I am the v_hue_g. You can also find me on LinkedIn. My name, Vanessa Huerta Granda. And yeah, I talk about incidents all the time and a little bit about reality TV, mostly about incidents.
Shane Hastie: Thank you so much for taking the time to talk to us today.
Vanessa Huerta Granda: Thank you, Shane.
Mentioned:
- QCon San Francisco
- Etsy Debriefing Guide
- Howie Guide from jeli.io
- Vanessa Huerta Granda on X and LinkedIn