BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Learning from Chaos: Architecting for Resilience

Learning from Chaos: Architecting for Resilience

Bookmarks
40:03

Summary

Russ Miles, CEO of ChaosIQ.io, shares how leading organizations are successfully adopting chaos engineering to encourage a mindset of "architecting for resilience". Drawing from a collection of real-world examples and experience reports, he shows how to set up systems to learn from controlled failure and make resilience an important competitive edge for an organization.

Bio

Russ Miles is CEO of ChaosIQ.io where he and his team build commercial and open source products and provide services to companies applying Chaos Engineering to build confidence in the resilience of their production systems. He is an international speaker, trainer and author.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

This is the first talk ever where I have a living legend in the room, someone I admire hugely. Here's a little tip for those that are thinking about doing talks, if you're ever in this situation where you're speaking in front of someone that you admire, quote them heavily. It works twofold, it's flattering, I suppose, I hope, but also you can sidestep any criticism of your talk if you just quote them, so that's my aim today.

My game is very simply this. I try my best to ride motorcycles as much as my investors will let me. The last year, I rode across the United States, I did 6,500 miles on the way to a conference. Vegas was interesting, I had five people in the room for Chaos talk, of which three had seen the pizza. They pretty much had no idea what I was talking about, so this is quite astounding to have this many people actually interested in Chaos for me, although when you go to the Valley, you get this much too.

That's what I did last year, and I learned a lot because after doing 6,500 miles across the U.S., I went to Tibet, as one does, and I rode a motorcycle through Tibet to Everest base camp, and I learned what software looks like, it creates a roadmap. It looks like a plane, but believe me, that's down a serious mountain and basically, just make it up as you go along sort of road, followed by the beauty that is production, followed by the evening that I spent at base camp at the wrong monastery, which is the reality of production.

The fact I'd like you to embrace, at least, is that production doesn't like you very much. Production is a place where your software is least likely to work, and it's also under conditions that's never experienced anywhere else. When you actually take your software to production, it's very unlikely that it's going to work at all. I remember I was over in Florida, and there was a discussion of pipelines. We're going to do this, then we're going to do this and it would just work. I watched the senior architects in the room, I don't care what you think of architects, they know stuff. When he turns around and I could see him in the corner room just going. When they said, "We do this, then we do this and it will always work," he went, "No, it won't. It never has before, so why would that change?" "Oh, yes, we're doing microservices, Oh, yes, we're going to the cloud, Oh yes, we've got these other new fandangled things." "It's still going to fail. I'm sorry. It's a horrible realization. Everything you're trying to build will fail in production," but that's okay, as long as you know it, as long as you embrace that potential, at least.

What Chaos Is Not

Starting off with who does chaos engineering? There's a few hands now, last year, nothing. I, actually, did a talk last year where I said, "Who here does chaos engineering?" I had to broaden that, I said, "Who here does disaster recovery?” How do you know it works? At what point do you sit there and go, 'I feel good about our disaster recovery strategy because it's comfortable?'" Never. The tragedy of disaster recovery is we only ever exercise it in the worst possible moment to try and exercise it when it's necessary. There's nothing like tweaking a system in the middle of a crisis.

Just a quick thing for you, when you're learning in a system you've probably realized this. If anyone's ever turned around to you, and you know when you're in the middle of a crisis, personal crisis, that moment when everything hurts, what do you think of the friend that turns around and goes, "I could have saved you this pain. You could have done something different." What should we action? I won't put it in crude terms, but I know what my reaction is, it's like, "Not now." No one learns in the middle of a crisis. I had this recently, I was in the middle of a crisis and someone said, "You know how you could avoid this?" I was like, "No, I don't want to know how I could have avoided this, not now. It's harsh."

The thing is that production is full of crises, and what we've been offered by chaos engineering is this, which, by the way, is, actually, a photo from an army of monkeys from China. If anyone knows me well enough, I had the world's worst book, my first book was for O'Reilly, and when I got the front cover sent to me, they said one word, "Sorry." It's a monkey that looks like it hates the reader, it's a monkey that looks like you've interrupted it doing something else, it's a terrible front cover. This isn't what chaos engineering is about, don't be the person that phones me up and says, "We've done chaos monkey in production. It's destroyed everything."

The next sentence is beautiful by the way. "We knew it would." You are not a chaos engineer anymore, you're a sadomasochistic at best. That's ok if that's how you get your kicks, but that's not what we're doing here. I'd like you to let go of the monkeys just for a moment, and I'm going to try and talk about what chaos engineering is supposed to help you with. It's not the answer to everything, but it'll help.

Being Wrong

Another confession, I am wrong frequently. There aren't many people on stage that will admit that. I am wrong on a regular basis, you've only got to talk to my ex-wives to know how wrong I can be, thank you, by the way, to them. The thing is I talk about this mantra in software development and I hope we will have it. We don't know what we're doing, they don't know what they want, that's normal, that's when it's working. You find me another discipline other than science or research that has that scope. You're wrong on a regular basis, and we still pretend we're right. We still pretend that we're building the right thing and I do this, I train people on how to architect and design software. Most of it can be distilled down to one simple message, I apologize for those who have paid for it. It's very simple, we don't know what we're doing. How about we realize and embrace that and build it in such a way that we can do the best we can with a collection of options that we select?

When it comes to being wrong, why are we upset by that? I guarantee you everyone here is wrong on a regular basis. If you don't think you're wrong, you need to examine yourself. We are always wrong, but we don't like it, we don't like admitting it, it's vulnerable, it feels like we're admitting things, it feels like a confession. Why are we scared of being wrong? You only have to look at definitions of being wrong. This is one, not correct or true, therefore, incorrect, that's not bad, but it gets worse. An injurious, unfair, or unjust act, now, it's bad, now, you could be sued, it gets worse. An action or conduct inflicting harm without due provocation, oh my goodness, and we do this for a living. A violation or invasion of the legal rights of another, if you will not really scared of doing software now, you should be. You have so much power to violate and invade others' legal rights.

So how do we navigate this morass? How do we embrace it? How do we embrace being wrong? I get this question. We're not wrong. We're right. We know what we're building. I went to a client recently and they said, "We know what we're doing. We know exactly what our business is, we know what our product is, we know what we're building, but we're going to do it in a different language, with a different framework, and on the cloud. No risk involved at all."

I would like to distill it down to knowing that we're going to be mistaken or incorrect frequently. It's not, actually, something that people like to say on stage, it's not something that anyone likes to state at the coffee bar, "I'm wrong today." I'm going to ask you this and I'm going to ask you to raise your hands. How many people here have admitted they're wrong in the last week? Well done, yes, and those who didn't raise their hands, you have been, work on it. It's a superpower, admitting you're wrong is where learning starts. I was a consultant, by the way, we are supposed to be right all the time, we are paid to be right. It's very hard to go into a situation and go, "I'm wrong. Yes, I told you to do that. Bad idea."

The beauty of being a speaker on stage year after year is that you get to come back and go, "Oh, yes, they've heard me before. Yes, that's where my mind was then, and now I'm here. Ouch, I've taken this journey." We are all wrong frequently.

The guy [on the slide] is a doctor that's got bad news, that's the, "I need to talk to you privately. I've got bad news." We are wrong all the time, they don't know what they want, we don't know what we are doing, we don't even know the architectural decisions that will matter, we don't know the design that will matter, we're making it up as we go along, we are ad-libbing. We could learn a lot from comedians who make it up as they go along on stage, like, me. I'm not a comedian, I'm not funny, but essentially, we are making it up as we go along. Embrace that.

Wrong is scary, though, I'll give you this. Risk, business risk financial risk, people might die. I used to work on fast jets, people think that fast jets it's pretty easy. Someone says, "I want a jet and it's got to be fast." Requirements capture done. I worked on 3D audio in jets, it's pretty awesome when you think about it. You got to fly along and if a missile is coming at you, you want to know where it is in a 3D representation. That was interesting work, none of that was in the spec, no one had sat down and said, "That's what we want from a fast jet." There's always risk, there's always jamming.

It's okay that there is risk. Who here works in a bank or a financial institution, FinTech? Do they know what they're doing? Do they know, exactly, what that system is supposed to do? Even there where you think, "Well, handle my money." Simple. There's plenty of room for maneuvering, the 2008 crash would tell us that.

We are worried about risk because of consequences mainly. It's the, "Oh, my goodness, someone's called me out" moment, the why us? There's two factors that contribute to this in software anyway. We've got feature velocity, go faster, quickly. I'm not going to ask you how many people here have feature velocity points they compare to other people, but if you are doing that, stop it, your features are not their features. We all want to go quicker, that's fair enough but at the same time, if we don't go faster with a bit of reliability, we haven't gotten faster.

Feature Velocity Vs Reliability

I'm a big fan of Formula One, I apologize for this. Most people don't like Formula One because it's basically watching cars in a car park at 200 miles an hour. I love it because they have to do two things, they have to go huge features, but also the car must run and if it doesn't, we have failed. I've worked with a few of these teams, I love this because it's a huge balancing act, but it's, actually, a balancing act that's a misinterpretation. We strive for reliability, and we think that it's feature velocity versus that, we can ship more, ship more but you know what? We have to pay down the reliability tax, otherwise, it won't be reliable.

That's not the case, it turns out, as our friends that wrote the book "Accelerate" would tell us, is that the faster you go, the faster you release, the more you might, actually, be reliable, and that's a weird thing. I'm going to tell you a quick story and it's a slightly old story but I worked at a company where they did continuous delivery. They continuously delivered once a year, it's continuous, every year, and what I loved about it, apart from the fact they had Signing Day, which is where they all got in a room and signed it off and said, "Yes, my stuff won't kill it at all." I always imagined there were proper fountain pens with blood in them. Apart from that, the beauty of it was the complete cognitive, ignorance of it, “We do release once a year, it doesn't work. Every year, we take two months fixing the stuff we just released and you can tell because they do projects that are nine months long. Why? Because we know we got to fix the crap from the last time.”

The truth is the faster you release, the more reliable you might be, the more opportunity you have for injecting reliability. The good news is there's absolutely no conflict between these things, the faster you release, the faster or the more reliable you might be. We take that in hand and then, we throw microservices in.

I'm not going to embarrass us all by asking what a microservice is because I've been teaching microservices for about four or five years now, and I start the course by going, "I do not care what your microservices are, and I do not care how big they are. What I care about is what you can do with the system that you couldn't do before. Could you deliver more frequently? Could you guarantee better availability?"

There was a number of factors that might encourage you to separate up your system into something that we could notionally agree is microservices. We might decide that it's better to partition up our people so they can work on things they can own. All good, the challenge is when you do that, there's a temptation to do what my friends in Florida did, which is build more tests, more gates, because we really want to be sure it works in production. The business are going, "Will it work?" We're going, "We've done microservices." "Yes, but will it work?" "That's not the question. We're faster now." "Yes, that's good. That's better than we were, but does it work?" "Yes, it could do." Ok, so the temptation is, so I'm going to go, "We're good, right? We're covered. We've done all these things. We've got big gates. We've got long pipelines. It's all good."

I'm going to give you a little story now and given the QCon audience, you probably know the story, I'm pretty sure some of you may be able to name this individual. Does anyone want to shout out? Margaret Hamilton. Margaret Hamilton was lead tech, essentially, on the Apollo program. I'm sure some of you will be having light bulbs go off now, but this story is not about her. This story is about her daughter, the world's first chaos engineer, or at least I like to think so. If you had to take your kids to work, what would you do? Get them out your hair, you get them to do something. "Go, go. Just leave me to do something important." Lauren, the most spoiled kid in the world, she got the Apollo mission simulator. Can you imagine going to school saying you played with that over the weekend?

She got to play with this, but she did what any kid does, she broke it, and she broke it spectacularly. She broke it to the point where you had to learn from it. What happened is she managed to flush the mission records out of the system with a key combo that could happen, and it, basically, bogged the whole mission. You can imagine, you've just taken off, but we haven't got any records anymore, a difficult moment. NASA's reaction was beautiful, and I bet you've heard this in your company, "That will never happen. All our astronauts are made of the right stuff. We've invested a lot of money in what they can do and the way they think, and they would never do that."

That picture would never have happened without learning from that situation, from that simulation, from that situation. It turned out that Margaret didn't listen to them at NASA saying, "It can't happen." She took Lauren, the first chaos engineer, she took her first lesson from that and went, "Ok, that can happen," and it did before that picture was taken, and I love that. The fact that it can happen means it might happen.

Think about this for a moment. How many people here know their system is working right now? I don't mean it must be working because customers aren't calling us. That happens, I've been to a lot of customers where I say, "How do you know it's working?" "No one has called yet." How do you know it's working? How do you know it's normal? How do you define normal? Just knowing what normal is might be a good starting point for your system.

When it comes to business, it's really easy to justify all of this with Gartner statistics, pinch of salt involved, possibly a mountain. The weird thing about the $100K per hour downtime cost is I actually did an exercise with a customer recently, where they phoned up and said, "Yes, we're thinking doing chaos engineering. Why would we do it?" They worked it out for themselves on the call that it was about $100K an hour they were losing. I don't care what you're losing, I don't care the number, your system is relied upon by people. That's worth more than the money, but, you know if the business needs a reason, give it money but it was amazing to see that that number, actually, was completely justified for a totally independent system for fun.

That's not even the big stuff, not the bad stuff. I have a confession to you, I don't work on core systems, I don't work on Netflix, I don't work on Etsy, I don't work on awesome systems that you all love. I work on the stuff that you ignore, and it has to be more reliable for it. When it comes to lost revenue, or just pain to individuals, people, its massive when it goes down.

There was a recent article in the BBC about banking outages this year, which was amazing, I encourage you to go and read it. I work on those systems, they don't make for fun talks, but they are really crucial. The whole point is that it's not really about the cool stuff. It's not about the stuff that you hear about in the press even. This is the stuff that happens on a regular basis. Systems are sufficiently complex that we can't anticipate where they're going to fail, and they do frequently.

Feature Velocity + Reliability

This is where I drop into John Osborn mode. I should point out, I am the warm-up artist for John Osborn later. If you aren't planning to be here, for John later, where the hell are you going to be? Because the guy's a legend, and I hope he's turning red now.

Can we design this out? Well, this is my story for you. Unless you're creating Hello World on a regular basis, you're not in simple, your systems are not simple. Sorry, they're just not, it's really attractive to create simple systems because we have best practices. As a consultant, I love simple systems, I can say, "Do that. You'll be better," and I can look like a legend, but that's not how it works. We probably think, you all here probably think, you're creating complicated systems. The downside is we have to let go of best practices, but we do have good practices. We can do this, a consultant can turn up and go, "Do this. You might be better."

That's cool, too, but if you're building anything modern that has a collection of distribution, it also has external dependencies on things you don't know what their reliability should be, then you're at least in complex, and it gets worse than that. It's not just the complexity of your system, it's the exponential potential complexity of everything it takes to manage it and look after it.

The thing about there is I don't know what's better for you. That makes it redundant when it comes to a conference by the way, people on stage can tell you what we think might be better for you. We can suggest, everything you've seen on stage is a hypothesis. Apply it with care because you have complex systems where it might emerge that it's better.

I have a worse situation for you because most people here have systems that evolve quickly. If you don't, well done. Honestly, I went to one customer in my entire career that said, "We don't change the system very often." I was brought in as an Agile consultant, I apologize, I was brought in to help them and I said, "Why am I here?" They said, "Because they want to feel better about it." I was going through it, what changes, what's the risk? No, none of it really, I want to make them feel better.

Most systems aren't like that, most systems change relatively frequently because that's how we get a competitive advantage, that's how we do better. If you have that, then it has the potential to deviate into the chaotic, and chaotic is not nice. The lovely thing and the phrase I use for chaotic is, we're surprised when it's better. We do something and we go, 'Wow, that worked. That's awkward.'" because then we do stuff and it isn't more frequently than it is. Wouldn't it be great if we had a technique that made us engineer ourselves out of chaotic into the merely complex? Chaos engineering, which, by the way, is a terrible term, we are not engineering chaos at all, we're engineering ourselves out of it.

Can’t This Be Prevented?

This is where I go proper Osborn on you. Can it be prevented? Can we design it all out? No, it's essentially that bad. There's accidental complexity and essential complexity. Accidental, which is very judicious, is the stuff we introduce by accident, I say it's judicious, there was a lovely study done about why we introduce complexity into systems, and there's a lot of answers that came out of that but one of the answers I love is we introduce complexity because we're bored by the business problem. Chances are you did not get into software to build the system you're building, you didn't dream as a child, "I will do the perfect HR system."

We make it interesting for ourselves, which is a dangerous cycle, but there is an essential level of complexity in there. It's not like the accidental stuff that we introduce, there's an essential level in there, and there's this wonderful phrase called dark debt. I like dark debt because most people think it's technical debt. If you've ever sat there in a stand-up or whatever situation you're in, and you turn around and go, "Yes, the thing you've asked for, difficult." You didn't know it, but you've accrued technical debt, that stuff you're asking for, ouch, two months of effort. You didn't know that? I know it's only moving the combo box from this page to another, two months. The business goes, "Why?" We go, "Tough."

Technical debt is something you know you're accruing, and it's not just about you knowing, everyone else should know, too. Dark debt is a great name, by the way, a great name to use at Halloween, a great name for something we don't know we're accruing. Every system of sufficient complexity has this, and you aren't working on simple systems.

Nothing beats trying to quote someone who is going to speak later, unless he wants to change his mind. This is a quote that I firmly believe in. "Do you want to double down on dark debt? You can't design it out, but you can get better at it. You can surface some of it, you can embrace some of it, and you can improve systems robustness for what you, actually, identify. If you're not doing it, you'll be bitten by it."

You're not covered right now, I imagine most people here are probably with a better head are not covered. Dropping a few more quotes, to actually describe a microservice-based system, there are lots of details. I'm literally paraphrasing the speaker later, I'm putting him on the spot, so later on, he's got to come up with new stuff. The rate of change is high, of course, it is, we're trying to embrace evolving systems.

We don't know what these parts are doing anymore, we didn't know before, by the way. For those that think they're building monolith systems and we knew what it was doing, no, we didn't. I don't remember knowing what those monolithic systems did before, let alone now. We've just thrown complexity into the mix to say, "It could all be different."

Blame

Reaction is this, ugly risk avoidance. This is where the management turn around and goes, "Stop now. We don't need microservices now. Now that you've explained it to me, we're not going to do it anymore." You also have the ability to turn around and go, "If it goes wrong, back to Signing Day, you're wrong." People think of blame being, "I blame someone else." I did a talk before I went on tour around the U.S., and I was impressed because I asked a room, "Can you share with me your production incident?" One person treated it as the best day of their lives. They leapt up, they threw themselves at the stage, it was quite a high stage, it was, actually, higher than now, they leapt up onto it. I come from the Keith Richards School of Stagecraft, which is if you come on my stage, I'll kill you, but he was a bit quiet, so I let him. He leapt up and he turned around to the whole room and he said, "It was me." He blamed himself before anyone could.

I have a suspicion that blame starts with you, you're the one taking responsibility. The problem is learning improvement stops with that moment, the moment you say, "It was me. I wrote the wrong command. I did the wrong thing," the system doesn't get analyzed, and we need to look at that, as well. It was interesting, this particular individual, "It was me", it turned out he had killed the master node of a grid computing cluster. He destroyed three- weeks’ worth of work, and I was hoping that they knew. Apparently, they did. He went through the story of what he'd done, he wrote the command out for us. The difference between the right command and the wrong command was one character. Who's at fault here? Well, that's still the wrong question. The system has failed might be a better question. Why? "What's happening," is much more interesting. As soon as he said, "It was me," the answer was, "Stop him touching the keyboard." That's what they've done.

A better reaction to these things is getting better at being wrong. You're going to be, so why not get better at it? I've trademarked it. No, I haven't but if you want a slogan, it's a good one, get better at being wrong. We know we're wrong frequently, the whole cycle, so why don't we get really good at it? Make it safer to be wrong. There's, actually, a few people out that say, "We do chaos engineering experiments where they destroy things in production, that it's safe." That's wrong. There are two types of things you do with chaos engineering, you do experiments where you don't know what you're going to find, you think the system will survive. By the way, that's a given, if you're doing chaos engineering, don't do anything unless you think the system will survive, otherwise, you are just a sadist. If you're inclined to run an experiment at 5 p.m. on a Friday and you know it will fail, step away from the keyboard, please, because that's how chaos engineering fails.

It's never completely safe, science isn't safe when we're discovering new knowledge, it can't be. We're surprised by the answer, we can be pretty intelligent about how much we learn at given moments in time, and in chaos engineering, we call that blast radius.

You're increasing technical robustness, but that isn't the whole game, you will increase technical robustness of your systems, but unless John takes me outside and shoots me later, my view on resilience is we're going to learn about the stuff we didn't know and we get better at learning as we go. That, to me, is resilience engineering, we get better in learning about our systems. I'm going to close out with this, that is the superpower. The one thing you could take away from this talk is learn about your systems, get better at learning, make being wrong a superpower. That's what chaos engineering helps you to do.

 

See more presentations with transcripts

 

Recorded at:

Jun 14, 2019

BT