Dead code needs to be found and removed; leaving dead code in is an obstacle to programmer understanding and action, and there’s the risk that the code is awakened which can cause significant problems. Deleting dead code is not a technical problem; it is a problem of mindset and culture, argued Kevlin Henney.
Kevlin Henney, an independent consultant and trainer, gave the opening keynote "The error of our ways" at the European Testing Conference 2017 in which he presented how a company lost hundreds of millions due to dead code that was awakened. InfoQ is covering the conference with Q&As, summaries and articles.
Software failures can be personally inconvenient or annoying, but they can also have a significant economic or social impact. Henney showed several examples where millions were lost due to small bugs.
Failures can come from dead code- code that is present in the system but is not supposed or expected to be used anymore. When this code is accidentally executed it can make systems fail dramatically, argued Henney. Therefore, his advice is to remove dead code in order to prevent this.
InfoQ interviewed Kevlin Henney about dealing with problems that can occur from dead code and asked him for advice for dealing with error conditions and exceptions.
InfoQ: You talked about how dead code that became activated led to a big loss for one company. Can you explain what happened?
Kevlin Henney: When the New York Stock Exchange opened on the morning of 1st August 2012, Knight Capital Group’s newly updated high-speed algorithmic router incorrectly generated orders that flooded the market with trades. About 45 minutes and 400 million shares later, they succeeded in taking the system offline. When the dust settled, they had effectively lost over $10 million per minute.
This bankruptcy-defining event arose from a perfect storm. In anticipation of a new NYSE system, to be launched on the 1st of August, they had deployed updates to their servers. They updated their servers manually and, unbeknown to them, one of the deployments failed, leaving the old version running. To take advantage of the new NYSE system, they recycled an old flag, a flag that was no longer used but had now been repurposed to mean something different. Although it hadn’t been used in eight years, the old version of the code still had a dependency on the old flag.
The code had been dead for years, but was awakened by a change to the flag’s value. The zombie apocalypse arrived and the rest is bankruptcy.
InfoQ: What could they have done to prevent this problem from happening?
Henney: In a perfect storm, any one of the contributors can be considered bad, but it is their combination that proves disproportionately bad. Changing any one of the contributing factors could have prevented or at least reduced the damage:
- The servers were updated manually: this is a job that cries out for automation.
- The failed update was not noticed: no one reviewed the updates. Aircraft doors are locked manually, but safety is promoted by having the crew crosscheck.
- Dead code is not truly dead until it’s buried. They should have removed the superfluous code years before.
- There are sometimes very practical reasons why flags, records, etc. are repurposed rather than new ones added, but if it is at all possible to add without recycling, addition is preferred.
- There was no escalation process in the company for dealing with a rogue trading system. It is often difficult to anticipate the exact nature of failures, but if the reason you have a high-speed trading system is to do what humans do but faster, it should be obvious that humans are also able to lose money. A high-speed system multiplies this ability. They should have had a clear escalation process and, even better, a stop-the-line culture.
- During those 45 minutes, not understanding the cause of the problem, they actually rolled back the updates that were successful. This compounded the problem rather than solving it. Panic-response messing with a live system that is going wrong for unknown reasons? Don’t do that.
- Mirroring their lack of automated installation, they also lacked a simple way to take the system offline. Automate both "update" and "emergency stop".
InfoQ: What suggestions do you have for dealing with dead code?
Henney: Find it. Delete it.
Finding it is the hard part. Sometimes dead code is genuinely unreachable, and static analysis can tell you this. The effectiveness of static analysis, however, depends on tools, language and architecture, but it’s a good start.
Insider dealing is considered an illegal practice on the stock exchange, but it’s perfectly acceptable to take advantage of insider knowledge in the codebase. Developers may already have a good idea of what code is surplus to requirements. Likewise, take advantage of product feature knowledge: when features are withdrawn or superseded at the requirements level, code associated with those features may also be due for retirement.
Another clue that can be used is code stability. Your version control system is a knowledgebase of change. Which parts of the codebase never change? There are many reasons code may be stable — it’s just right, it’s just dead, it’s just too scary — but unless you investigate you’ll never know. Of course, dead code may still end up being changed as a consequence of an automated refactoring, but such changes also have a signature: their changes correlate with other refactoring changes, but they are not changed to fix bugs or add features.
Runtime monitoring of the system cannot definitively tell you which parts are dead, but they can tell you which parts are definitely alive. This helps you narrow the search.
In short, develop hypotheses and investigate.
Deleting dead code is not a technical problem; it is a problem of mindset and culture. There is often the sense that if code is not doing anything, it has no effect, so it’s OK to leave it. It is worth keeping in mind that exactly the same reasoning also allows us to remove it: if it’s not doing anything, remove it. Let the version control system remember it for you.
There are many reasons to remove dead code, not just the possibility of a zombie apocalypse. Dead code makes the runtime footprint larger than it needs to be. If you care about performance, then care about performance. Dead code is an obstacle to programmer understanding and action; bulking out a system with dead code wastes developer time and discourages a culture of treating the software as soft, and therefore always open to revision and improvement.
If you’re unsure whether or not code is dead, and that’s why you don’t want to remove it, then that uncertainty is already telling you a great deal about your architecture and the developer relationship with it.
InfoQ: You also talk about problems with code that is not dead, but is often neglected and almost left for dead: error-handling code. How can incorrectly handling non-fatal errors lead to catastrophic failures?
Henney: The more a piece of code is used, the more likely it is that its bugs will have been revealed and fixed. Error-handling code is often some of the least well-explored code in a system. Many error cases are rare edge cases, so although code is present to handle them, the correctness of the code is unproven. And then the rare condition arises, taking control flow down a broken path — code that’s supposed to be handling an error condition, not creating a new one — and boom. Sometimes quite literally, as was the case with the loss of the first Ariane 5 rocket in 1996.
InfoQ: What advice can you offer for dealing with error conditions and exceptions?
Henney: Reviews, static analysis and automated tests. Walking and talking through can help reveal oversights and generate new questions and awareness. Depending on the language, you can get some helpful feedback from tools about what might happen at runtime.
Testing is one way to be sure that a piece of code is executed in a safe environment, although keep in mind that people often have a blind spot and optimism bias where error-handling code is concerned, testing only the happy-day scenarios. To address this, always ask: If an error condition is signalled, where is the corresponding handling code? Where is the test for the error condition? And where is the test for the handling code?