Russ Olsen did the opening keynote titled "To the Moon" at the GOTO Berlin 2015 conference in which he explored parallels between space missions and developing software.
InfoQ interviewed Russ Olsen about drawbacks of doing all the things at the same time to meet the deadline, learning from things that went wrong and from things that went right, how little things can kill you in software development, and how to focus and deal with details when doing complex work.
InfoQ: Most of the talk is in the form of the story of the first Moon landing. Why did you choose to structure the talk that way?
Olsen: You know, people use the phrase "This changes everything" quite a bit. Rarely does the thing they are talking about change anything, let alone everything. That Sunday afternoon when we landed on the Moon really did change everything, at least for me. I suddenly had a direction in life, I knew what I wanted to do when I grew up. And I’m not alone: I’ve met any number of people who walked away from that day transformed. I tried for a long time to get that experience across to folks who weren’t fortunate enough to be there, but the only thing that ever worked was to have them relive it with me, moment by intense moment. Hence the talk.
InfoQ: In the space missions people did all the things at the same time to meet the deadline. Is this something that you also see in software development? Do you think that it’s effective?
Olsen: The situation that NASA had with the Apollo project was that they needed to finish a nearly impossible project before an insane deadline. Since the timeline didn’t allow for doing things in a logical "Build step two on top of step one" process, what they ended up doing was simply making a list of things that needed to happen and doing them all at once and then integrating them at the end. This is a horribly inefficient way of doing things: You inevitably have people building components that don’t work together properly. You have people solving the same problem over and over because they are unaware someone else is also doing it. And you have huge amounts of rework as the overall design changes.
The most obvious symbol of the crazy way that Apollo was done is the huge rocket engine on the back of the Apollo mother ship. That rocket engine was originally designed to lift the mother ship off of the surface of the Moon. But as the overall design of project solidified, they included a separate, specialized landing spaceship. So the mother ship didn’t have to land on -- or lift off from -- the Moon after all. And that meant that the mother ship didn’t need a monster rocket engine. But they had already designed the huge engine in and it was easier to just leave it.
Sadly, I’ve seen the "build it all now and integrate it later" technique used all too often in software development. Apparently I’m not the only one: In the talk I joke that this way of going about building systems "Is always a great idea". People -- especially developers -- always laugh. Why do developers sometimes go down that road? For the same reason the NASA people did -- crazy goals and insane deadlines. Let me put it this way: If you are managing a software project and you are setting impossible goals and crazy deadlines, well, depending on your team they might just pull it off. Probably not, but possibly. But succeed or fail, the impossible goals and crazy deadlines are going to cost you in wasted effort, in team burnout, in broken designs, in a hundred different ways.
InfoQ: Can you give some examples how people learned from the things that went wrong in space missions? Do you see similar learning patterns in software development?
Olsen: One thing I think you see in aerospace in general is a relentless and methodical study of failures, both actual and potential. If a part fails in an airliner or a space vehicle, there is a systematic set of questions asked. Not just "Why did this part fail?" but also, "Why didn’t we catch this sooner?" and "How often is this happening in other vehicles?" Not to mention "Is this failure an example of a more general class of failures?" And not just out and out failures: Aerospace engineers are always looking for anomalies, things that should not be happening even if they aren’t doing any immediate damage. Critically, aerospace engineers tend to take a data driven, statistical approach to failures.
It’s getting better now, but traditionally software developers have treated failures as random bolts from heaven: Perhaps the logging service went down, so we fix the problem and forget it. Or worse, we treat software defects as evidence of the moral failings of the developers involved: "How could you let this happen?!!" is an understandable reaction to software failures, but not a productive one. We have also tended to ignore anomalous behavior: If it ain’t broke, don’t fix it. Except that sometimes your system isn’t broke but it is desperately trying to get your attention, trying to tell you that it is about to fall over.
You can look at software systems as complex machines composed of many parts. And it turns out that gathering actual data about your system -- including the failures and the "funny" events -- and trying to build an understanding of your system based on that data is something we should all be thinking about.
InfoQ: How about learning from successes, from things that went well? What’s you view on that, is it sufficiently done in the software industry?
Olsen: One theme I’ve seen time and time again in the software industry is that we over generalize the lessons from our successful projects. We do a project with a clear well understood goal, an engaged customer, solid technology, flexible thinking and good design documents and we say, "Hey, that project worked because we have good design documents!" Or perhaps, "We met with the customer for an hour every day, that’s what made it all happen." Or "We understood the requirements", that’s the magic bullet!" The trouble is, there are a million things that need to go right to build a complex software system. If you have a success then great, do try to identify the root causes of that success. But note the plural: causes.
InfoQ: You mentioned that little things can kill you when you are in space. Is that also the case in software development?
Olsen: Software development is about getting things right at all of the scales. Clearly getting the overall design of a software system right is vital. But a great design poorly implemented is not going to do you much good. Nor is an otherwise good implementation that is full of bugs likely to make anyone happy. Certainly any number of software systems have been brought down by "minor" bugs.
InfoQ: Do you have suggestions how you can focus when doing complex work? How can you deal with details that matter?
Olsen: The idea I tried to get across in the talk is that when you are building something as complex as a Moon rocket, or perhaps a medium sized enterprise information system, there will be failures. Let me say that again: There will be failures. Software developers spend their days trying to get the details right and their nights and weekends looking for new ways to ensure that the details are in fact right. But if we have learned anything in the past half century of programming, it is that you can drive the defect rate down, but you can’t make it zero. So we need to build our systems, build our organizations, build ourselves an outlook that accepts that failures will occur and be ready to deal with those failures. The right attitude is to work as hard as you can so that this function, this service, this system is as reliable as you can practically make it. And then work on what you are going to do when it fails. The famous quote from the US space program, that "Failure is not an option" is, in fact, exactly wrong when it comes to large scale software systems where failure is a certainty. The only question is, will you and your system be prepared when something does fail?
InfoQ is covering the GOTO Berlin conference which takes place on December 3 and 4. If you’ve not seen it you can watch a previous version of the talk that Olsen gave at QCon London in March.