This is.a re-post from August, 2020.
In this podcast, Ana Medina, senior chaos engineer at Gremlin, sat down with InfoQ podcast co-host Daniel Bryant. Topics discussed included: how enterprise organisations are adopting chaos engineering with the requirements for guardrails and the need for “status checks” to ensure pre-experiment system health; how to run game days or IT fire drills when everyone is working remotely; and why teams should continually invest in learning from past incidents and preparing for inevitable failures within systems.
Key Takeaways
- Enterprise organisations want to implement “guardrails” before embracing chaos engineering. Critical capabilities include being able to rapidly terminate a chaos experiment if a production system is being unexpectedly impacted, and also running “pre-flight” status checks to verify that the system (and surrounding ecosystem) is healthy.
- The global pandemic has undeniably impacted disaster recovery and business continuity plans and training. However, it is still possible to run game days or IT fire drills in a distributed working environment.
- All software delivery personas will benefit from understanding more about disaster recovery and how to design resilient systems. As more teams are building complex distributed systems it is vitally important to encourage software architects and developers to learn more about this topic.
- Much can be learned from analysing past incidents and near misses in production systems. There is a rich community forming around these ideas in software development, inspired by the learning from other disciplines.
- To minimise chances of user-facing failure during important operational events or business dates, such as sales or holiday events, organisations should generally start planning 3-6 months out. This time allows an organisation to update service level objectives (SLOs), update runbooks, conduct fire drills, add external capacity, and modify on-call rotations.
Subscribe on:
Transcript
00:27 Introductions
Daniel Bryant: Hello, and welcome to the InfoQ Podcast. I'm Daniel Bryant, news manager here at InfoQ and product architect at Datawire. And I recently had the pleasure of sitting down with Ana Medina, senior chaos engineer at Gremlin. Ana has been running a series of online workshops focused on chaos engineering for quite some time now, and she is a frequent contributor to online discussions about how to run game days and resilience fire drills within organizations. I was keen to hear about her recent learnings in this space, and also understand how she's adapted the training in relation to the global pandemic and everyone being distributed and often working at home or not in their office. I've also seen that Ana and the Gremlin team have recently been discussing the need to understand and verify the current health of a system before running chaos experiments, and so I was keen to dive into this topic in more depth and understand both the technical and the social aspects to this.
Hello, Ana, and welcome to the InfoQ Podcast. Thanks for joining me today.
Ana Medina: Thank you for having me, Daniel. Very excited to be here today.
01:16 What has been happening in the chaos engineering space for you in the past year?
Bryant: Awesome. I think the last time you and I caught up was in New York last year, Casey Rosenthal put together Chaos Community Days. I think a lot has happened in the domain of chaos engineering since that date. I know you've been doing lots of exciting stuff in Gremlin. So, what's been happening in the last year or so? What exciting stuff have you been working on?
Medina: Yeah, I think it's definitely been a really long time since we last spoke, so very excited to be catching up today. We've been working on a lot. I mean me, myself, I've been diving down into doing more cloud native chaos engineering, focusing on Kubernetes. I gave a keynote at KubeCon talking about some of the most common failures around Kubernetes, which that was something that I didn't really have in my goals, so doing it felt like a huge accomplishment.
And apart from that, been working on trying to advance training for chaos engineering, so I've been putting together a lot of on-hands practices for folks who learn the fundamentals of chaos engineering for them to actually start implementing best practices and their organizations and for them to be interactive. And of course, as the pandemic took over, that became something that I focused on very much, and Gremlin, we've been focusing on closing the feedback loop of chaos engineering for folks to have a little bit more understanding of what their chaos engineering experiments are meaning within the platform.
And one of the other main things that we had also just recently launched is something known as status checks, health checks, a way to actually take the temperature of your system prior to injecting that failure. So, it's very much that safe guardrail that a lot of our enterprise customers, specifically the finance Fortune 500, those were the ones that really needed that safe guardrail for their team to start automation and production and such.
03:09 You mentioned about executing “status checks” to ensure system health when running chaos experiments -- should these run continually, or are they only run before each experiment?
Bryant: Very interesting, Ana. Something I want to pick up on that you said, you mentioned “taking the temperature”. That's super interesting. So, is that like a dynamic thing, or is it like a one-off, my system is ready for chaos engineering? Or is it a continual thing you'd be running?
Medina: So, right now, the way that we have status checks laid out is that you create a scenario, and a scenario is various different steps within a chaos engineering experiment. So, we're seeing that folks are starting out by taking that temperature check of their system. Are there any incidents running? How is the service level metric doing? And another one, it could be very much, is my website up? Is there like checking a calendar? We launched status check with partners like Datadog, New Relic, and PagerDuty, that are the perfect components that a lot of companies are using, but we also allow for any API endpoint to actually be added so you can create your own.
So, it could actually be like you check your own calendar. Is there a game day going on that we shouldn't be running more experiments on? But I think for the purpose of production, where you're just like really not trying to cause more harm into your system, and it goes into that big picture of chaos engineering, it's always thoughtful, it's always planned, you want to make sure that you're not going to cost a larger outage, you're not going to page someone when they're already dealing with another incident. So, having those extra guardrails is really important.
So, we have it laid out that you add a status check, and then you can continue adding attacks. And they can be different types of a resource attack, a network attack, and such, and then you can do another status check. So, you can literally be launching status checks and attacks right next to each other. And maybe down the line, there'll be some continuous status check that would allow for you to know how your system is actually doing throughout the entire chaos engineering experiment.
05:03 Who should be involved in writing system status checks? Ops, SREs, developers?
Bryant: So, is it more of a ops thing, more of an SRE thing, more of a developer thing? What typically, who would be involved, I guess, in defining the health checks, the status checks?
Medina: I think we have that conversation of shifting left, where we want those devs to be aware of what is really important for them to not cause more harm into their systems. So, if devs have access to this tool, we hope that they're also setting up status checks, but we also have seen a lot where the SRE teams are actually the ones that are owning chaos engineering, implementing those practices, and building up the dev teams. So, as they're already putting the experiments themselves into the platform, they're teaching the rest of their team that this is actually something that they can use. So, I think a lot of it is how it's adopted at that organization.
05:50 Does defining status check get folks to think both about the technical and social aspects of a system?
Bryant: Yeah. Very nice. Very nice. And I'm guessing this goes hand-in-hand with some of the more organizational factors, as well. The status checks almost gets you to think about, "What should I be watching and what should I be alert for?" Not only at like a technical level, but also something you said, which is quite nice there, is look at a calendar maybe. Maybe there's a really important day today, or maybe there's a game day running already. So, is it quite a nice tool, these status checks, for getting folks to think about what's good and bad?
Medina: Yeah. We know that these are complex systems. There are so many different things that are changing from not only like the code we're deploying to them, but our users and current news. There's a pandemic going out, and you have a retail store and all the stores are shutting down, you won't necessarily have it top on mind in that moment unless your Ops or SRE that there's going to be a heavy load on your site, like more than usual. So, by having those contexts that are missing from a lot of those mental models, having the tools help you figure them out, it's just like that extra guardrail that you need, that crutch that you didn't know that you needed so that you actually don't end up stumbling down the line.
07:00 How do you think the pandemic has affected an organisation’s ability to deal with the inevitable failures that happened in IT?
Bryant: Yeah. Very nice, Ana. Very nice. You mentioned a few times about the pandemic. Unfortunately, we've all got to deal with the strange times we're living in now. I was keen to get your insight onto how this has impacted folks dealing with failures, because many companies I've worked at, the disaster recovery was all assuming we were going to be onsite. Stuff happened, we all rushed towards. Yeah, we obviously maybe get on a conference call first off, but then we all... There'd be physically folks in the data center, for example. How do you think the pandemic has affected people's ability to deal with the inevitable failures that happened in IT?
Medina: I think it made a clear definition of those folks that were ready to handle high traffic and had actually been preparing for it, had a experience team had a resilient infrastructure. And then, we had those that didn't really prioritize reliability, resilience within their organizations, that they end up stumbling a bit, and with that, it's very much of like, for some organizations, they were starting to prepare for their peak traffics that were about to come. So, the pandemic just started verifying at a nice, slow rate curve that things were working really well, and then, of course, as things continued being shut down, they start seeing a higher load of traffic and they're able to handle it well. And then, now, when they're high traffics are about to be here, it's like, they feel really comfortable.
But I think you mentioned that really strong point where like all of our teams are distribute, and maybe if we were working on site... Like we were already at the data center, so we were easily able to go there. If the data center caught on fire, it wasn't that hard to be like, "Okay, we're going to drive to the data center as a team. We'll figure it out." Now, you have other things to think about if you actually need to troubleshoot something like that. But with that, we have everyone completely distributed, and Gremlin has been remote-first culture so we didn't really have an issue transitioning when the pandemic hit. Which was actually really interesting because I see friends freaking out, and I'm like, "This is just a regular day for me, except I actually don't have an office to go to." I'm like for sure staying home, I'm not going on to the airport, I'm not in a different city. I'm just home for a really long time.
But for those folks that their teams have never been distributed, you have to start building that trust, you have to start building those relationships with those folks, whether it's the devs and the ops folks, whether it's the SRE team actually engaging more with the dev teams. We see that those things were still silos, to an extent, so having that extra communication has been helpful. But one of the things that I've like touched upon with a lot of our prospects and customers is very much that chaos engineering actually helps you build those relationships, allows you to do fire drills, to prepare for incidents that could happen, and for you to actually go through your runbooks, make sure that everyone has access to the right tooling, and maybe even understanding that everyone's going to be working different times because of parenting duties or doctor's appointments and things like that.
So, you start realizing that things are happening more ad hoc and you need to enable your individual contributors in every single team to be successful on their own. And at the same time, have the right path of escalation or the right path to continue doing their searching for more tooling, more observability, or anything that's needed.
10:19 How can engineers run game days of “IT fire drills” when they are out of the office or working at home?
Bryant: How can folks run game days from home now, I guess, like some of them? Is it viable to run game days as you're fully distributed? And if so, have you got any tips for folks that might have always done their testing in the office and now they suddenly they're pushed out all remote?
Medina: Yeah. Gremlin has always been putting out some resources on game days. We actually haven't fully fleshed out that remote game day runbook, you would say. I think I'm going to go pitch it to my team that that actually might be like a fun one. Because it's true, you always talk about game days, chaos days, or anything like this as this onsite experience that you would have at least four or five folks in the room, but you also want to make sure you have your managers, an intern, possibly a VP, for the right folks to be there for all those conversations, to have, one, the architecture diagrams and understanding of your systems, what tools there are, past incidents .All this knowledge that we know is never really documented and people were holding in their heads, you want those conversations in the room.
So, as we only have remote, we only have virtual tools, I think it's possible to run a really successful game.day virtually. I mean, I know a little bit of how to do it because I mentioned that I've been working on the workshops and the training. So, there are a lot of collaboration tools, and one of the things that could be done is setting up a Zoom call, having like a Google doc or like a collaboration that would actually lay out what is going to happen that game day. And maybe you actually break it up into different meetings where you have your pre-planning process and then you actually have that call right before just to make sure everyone's on track.
And then, when the day of the game day actually comes, everyone's already on the same page. Whether you have only dedicated one hour, you have dedicated four hours to run some possible chaos engineering experiments, you have already laid out all foundational work that sometimes it's easier to do in person, because you're not talking over one another, so it's like you've already broken that off from the main event. So, you've really prepared well for it.
And then, when the game day comes, one of the ways that it just works best is if you have assigned roles to folks, so it's going to be like, "Sally, you're going to be the commander. You're in charge of owning this exercise." Someone is going to be taking notes. Someone is going to be observing through the observability tools. Someone is actually going to test being that user into whatever system you're testing. So, as folks have all these rules assigned, you get to focus on just that one piece of the game day puzzle, and that allows you to be a lot more successful.
And we've been trying to incorporate that culture within the workshops that Gremlin runs. These bootcamps are a place for folks to come together like two, three, four hours to learn the fundamentals of chaos engineering, and then we transition to an hour of hands-on experiments. And in the hands on component, I bring up cloud infrastructure, which is usually Kubernetes, we put in monitoring, observability tools, and then we have a microservice demo environment. And we tell folks, "You're going to be put together in a group of four. There's four roles. Decide amongst yourself what you want to do." And as we go through the exercise, we tell every single role what they need to be doing. Like, "You need to log into this tool. You need to look here." And by guiding them, that has actually helped them just focus on one thing.
Of course, we still have to figure out a lot of the tooling component, but one of the nice things that we were able to leverage is that the platforms like Zoom allow you to have breakout rooms that you can actually put folks into groups of four, like assign them with your own intent, or you can let Zoom do this randomly. So, it would be an interesting way to see small game days get launched in that way within an organization. And at Gremlin, we actually transitioned a bit of the way that we were doing some of our game days. Like Jason Yee, the Director of Advocacy came on board, he had been doing chaos engineering, also, at Datadog.
And when we've been working together on things to make Gremlin more reliable, we have a few projects in mind, but we've been, actually, instead of looking at doing big game days within the organization, we've been doing small little game days within every engineering team, with other folks of their organization also coming on board for them to understand more about how it's going. And it's like, we have to practice what we preach, so we're also trying to figure out new ways that we can push that needle forward.
14:52 How do you encourage folks that identify more as traditional software architects into participating in game days?
Bryant: Awesome. I want to pull out a couple of things you said there, Ana. I think the training for these things is really, really important. In my experience with some of the ops stuff, I've had a hard time getting buy-in from, say, an architect persona. Everyone has a clear value in the organization, but some of the interactions I've had with people that identify as an architect, they're like, "I didn't need to worry about the op stuff. I don't need to worry about the failures." Yeah. Is that something you've seen? And how do you encourage folks that perhaps are more traditional architects into this learning experience?
Medina: It's interesting because I've had a lot of architects come through my workshops, and you feel confident in your architecture diagrams. You don't want anyone to tell you that they're wrong. And one of the first exercise that we've put together is like, "Let's actually test those critical dependencies. What is a critical path of your application, that you can actually create a hypothesis where your application is going to be super usable, the product is going to have degrades happen, where the user is able to go through the steady state flow for the critical path without any issues?"
And when we see that, we provide them architecture diagram that they didn't even create, and they feel super comfortable because it's like, "That's an architecture diagram. I trust it. I believe it." And then, we're like, "Let's go actually test what those critical dependencies are." You look at it. You think that if the ad service goes down or your caching layer goes down, your application is going to be okay. And when you inject that failure in order to block the traffic of that service, of that container, and you see your entire application completely break, it's like, "Oh, that architecture diagram may have not been accurate in this moment." So, that is one way to put it in front of their face, like, "Let's run this in a controlled way and actually show you that there might be some things that your mental model of your work is really not what's actually happening in real life."
And with that, like for ops and SRE, it's like, there's various layers to it where you automatically think that after you set up your dashboards or your monitoring observability, everything is going to work. And by buying this tooling without doing any training or setting up, you're still not ensuring that things have been set up properly, so you want to go and verify that monitoring. The same thing with runbooks and any incident support that your team has, you create them, you hope that they're going to be the right tools for when your team really needs it, but when that moment strikes, and your team has to go through this flow and they realize that it's not working, that's really expensive.
So, when we talk about bringing in chaos engineering really early, it's very much about having those fire drills, having those teams practice the disaster recovery process, practice what it's like to be someone that manages an incident that actually has to escalate, and things like that. You really don't get those practices. In my talks, I always joke around, it's like when you got put on call, raise your hand if you just got thrown a pager and got told, "You're going to be successful"? And it's like, that's really not training whatsoever.
17:57 Can you offer any guidance on how to design chaos “attacks” or experiments against a system?
Bryant: So, if we were to take a step back and try and understand where to perhaps attack our system or experiment with our system, have you any guidance for folks? Because I've been doing some work recently around threat modeling. So, there's nice books and there's plenty of literature around how to build this mental model of systems and how to look for potential security vulnerabilities. Is there anything analogist, or something that's similar, for how to approach chaos? experiments?
Medina: There's a little bit of that. I think like two of the starting points that I recommend always the most, it's like testing those tier zero, tier one services. You really don't want to go for what's known as a low hanging fruit, like working on systems that don't really have a business impact or don't really have users. Whether it's internal or external, that's not really going to give you that return on investment. You're putting in a lot of time and planning and time and execution of chaos engineering, so you want to make sure that the time that you're putting in is going to be valuable. So, by focusing on those tier zero, tier one services, you're going to focus on something that can actually be really valuable for your team and organization.
And then, the other way that we tell folks to think about when they're thinking about where to focus with their chaos engineering, it's like, "Let's actually look at your past incidents. You know that you suffered a lot of these failures. You know that you wrote them up, hopefully in a really blameless way." And after you've written them up, you put together some action items. You put your tickets, you put them somewhere in the logs of things that you want to get done. Maybe your engineering team was super proactive and actually got all those tickets closed out.
Hey, that actually doesn't usually happen, but let's say your team actually did fill out all those tickets, got them closed, put in those patches via code, and processes. How do you actually go verify that if the conditions of your incident were to happen again, your system is actually resilient? You don't, and that's actually a really, really good place for folks to start practicing chaos engineering of, "We suffered this incident in the last few months, let's actually make sure that we have learned from them and that we are now going to be in a better place due to it. Let's actually get an investment from the downtime that our company suffered."
The other note from learning from incidents is that I think it goes a step above just learning from your own organization. There's huge communities out there that are talking about incidents that are happening in the platforms that we all touch. So, it can be from like listening in to podcasts that talk about our systems failures or if we look at learning from incidents from Nora Jones and some other folks and resilience engineering, it's like, we have amazing write-ups of what happens when complex systems actually show that they're really complex and your team doesn't know that. And we have an entire GitHub repo that talks about other postmortems of less in depth analysis of those incidents.
20:55 What are differences and similarities between chaos engineering and resilience engineering?
Bryant: Yes. I actually asked Nora a few weeks ago on the podcast, and she was very much emphasizing that we're dealing with socio-technical systems. And she was also discussing the differences and similarities between chaos engineering and resilience engineering.
Medina: And with that note, it's very much of like chaos engineering is just a subset of resilience engineering, in a way, where you still are focusing on making this complex systems really, really resilient, but chaos engineering currently just focuses on the software space. But within the software space, you focus on the people, the processes, the code that's there, the infrastructure. So, we only focus in technology, in a way, as of right now, and with resilience engineering, we have this entire field that's not just software. We talk about the medical field, we talk about aviation, so it goes a little bit about that.
Bryant: Yeah, that's nice. I think it's something that I've definitely seen in my career, and guilty as charged, for sure, we don't always learn from other fields. And there's actually many other interesting sources of information out there, I guess. Yeah?
Medina: Yeah. Definitely. I think is where we talk about like in our world where there are silos within our teams in the organization, and when we look at that industry, is very much of like they're silos of how to build reliability, how to actually build resilient folks, how to actually make sure that you're training folks properly. I know that the conversations that always come about are aviation and pilots and the amount of training that they actually have to go through from testing to staying up to date.
But when we talk about software engineering, we rarely do even a fraction of the training that these folks do, and to an extent, our technology can hurt as many people as a crash of one of these flying devices can. So, the fact that we haven't built this conscious effort in software engineering for every single engineer that wants to become an engineer, is an engineer, whether it's ops or SRE as well, to be conscious that the technology is going to fail, and that can actually harm people, whether it's devs or the fact of really critical systems that are needed. We still need to make a little bit more progress in the industry for that.
23:04 Does the software industry pay enough attention to ethics?
Bryant: I think that's very well said, Ana. I remember when I was at college, or university, I did have an ethics course, but it was very much on not copying other people's work. Right? Not actually the impact of my system can have on other people. So, I think perhaps that something I should feed back in and say, "Hey, we need to change this ethics course. Yeah?"
Medina: Well, and I mean, I think even on ethics, the entire technology industry just needs to sit down and learn ethics for a year. And we have various complexities on that from the parts of the folks that have actually built those systems and maybe the injustices that have gotten built in, whether it's like racism or just class separations and things like that. So, I think for ethics alone, that is just an entire other conversation that I really hope that, as we're seeing things go, like the software industry is going to wake up and realize that we need to have more diversity in the folks that are building those systems, whether it's code or putting processes and bringing innovation in, but also the policies that are tied to that to make sure that human rights are protected and every single individual is not getting unfair treatment due to the technology that they're using.
Bryant: Yeah. The whole world is a complex system. I think as engineers, we like to make it simple sometimes, but you've always got to look at the externalities. Right? The impact. Because we can all do incredible things as engineers, but I'm starting to realize my responsibility now of thinking, exactly what you said, thinking some of these things through. It's really important to do that. Yeah.
Medina: I think that big picture model of like how our software actually affects folks, a lot of folks don't have that.
24:35 How far out should organizations and teams start planning when they want to ensure or verify their systems are resilient for an important date, something like a Cyber Monday or a summer sale?
Medina: So, changing gears a little bit, I was keen to get your insight into how far out organizations and teams should start planning when we want to ensure or verify, I guess, that our systems are resilient for an important date, something like a Cyber Monday or a summer sale?
I think that you should be planning three to six months before. I think one month is cutting it way too short. Considering a lot of dev cycles that take two to three weeks to actually get through production, we're talking about barely having any time to get that code out there, nonetheless, make sure that your systems are resilient as this code has actually been implemented into your systems. So, I think that three to six months is usually the sweet spot, and that also depends on what type of industry that you're looking at and how much practice you have had, like the last year, how did it go?
But also, since then, have you actually been running a lot of these practices, whether it's running load tests, capacity testing, whether it's doing more chaos engineering, whether it's implementing more cyber liability engineering practices like SLOs or updating runbooks, making sure that you're doing fire drills of your runbooks, adding external capacity, modifying on-call rotations, all those things. If you actually want to prepare, you're going to need those three, six months. There's industries where folks actually prepare a year out, and we also talk upon, in the report, where in older enterprise companies, we see that you end up doing a code freeze in order to actually prepare for that, and that sucks for devs. Your job kind of stops.
26:12 Are code freezes a viable strategy for ensuring resilience?
Bryant: Yes. Yeah. Not good. What do you think on the code freeze practice in general? Is it's the viable these days, do you think? Or maybe it's like a warning sign that things need to change?
Medina: I think it's a warning sign that things need to change. I think there comes moments in organizations where doing a code freeze is the only thing you can do, where your team might just not be ready to understand, "Hey, we can't be creating changes because we're dealing with too many incidents. We need to figure out what's going on in our systems first, prior to adding more stress onto it or more complexities to it, it could be." So, I think, in moments, some organizations should be doing that, depending on their infrastructure, but for the most part, we see that a lot of the more modern organizations are using DevOps technologies, that they're using site reliability engineering practices. Those folks are able to move at a faster rate in shipping code, iterating, and of course, that also means that things actually might be breaking more, but maybe because our teams are trained up, they know how to handle those incidents a little bit better.
27:13 Are you running any online training courses soon?
Bryant: So, you mentioned about your workshop and training courses. Are you planning on running them at any online community events coming up soon?
Medina: Yes. So, I run training courses every single month. If you go over to gremlin.com bootcamps, you'll find an entire landing page about the workshop offering that Gremlin has. These are free courses where you can come and learn about chaos engineering. So, we rolled out the 101 that is the fundamentals of chaos engineering. You still get to get hands on learning. And then, we're rolling out right now the 201 of chaos engineering, where it's about the automation of chaos engineering and doing continuous chaos engineering in order to make sure that you're not drifting into failure, so that's something that's going to be coming up in the next few months.
And as far as conferences goes, we're putting together Chaos Conf again, so this is going to be the third year that Chaos Conf comes about. And it's going to be a virtual conference. There's no choice on that front, so on October sixth, seventh and eighth, we're going to be having this large chaos engineering community for folks who actually come and talk about different things that pertain to chaos engineering. So, we've broken it out that the three days have different tracks. On day one, we're going to talk about reliability comes with practice. Day two, it's all about completing that DevOps loop. And then, on day three, we're talking about the data-driven culture of reliability.
So, the CFP is actually open, so if any of these themes actually catch your attention, we have the CFP open until August 14, so please go ahead and apply, submit anything that you might be munching on. And if you have any questions, you can always reach out to the Gremlin team to give any feedback on some ideas that you might be having. And we already have registration open., So if you head on over to chaosconf.io, you'll not only get the CFP link, you can also register, and you get to learn that we have two amazing keynote speakers coming in.
29:12 How can folks get in contact with you?
Bryant: If folks want to get in contact with you, where's the best place? Are you on Twitter or LinkedIn? Or where's the best place to follow you?
Medina: Yeah, you can actually find me on all social media. I have that pretty open to the industry. So, I would prefer you contact me via Twitter. My handle is Ana_M_Medina. That's probably the best place. And then, if you also want to reach out, LinkedIn works really well.
Bryant: Nice. Nice. Well, thanks for your time, Ana. I really appreciate it.
Medina: No, thank you very much for having me in, as well. Thanks for this amazing conversation.