Key takeaways
|
In my career I have heard this critical question from only one of my employers and consulting clients: is this software safe?
In our method wars we expend massive energy on features, epics, stories, requirements techniques, testing approaches, and more. But is what we produce "safe"?
A safe product:
- Does not cause property, financial, or data damage.
- Does not physically, emotionally, or reputationally harm human beings or other sentient creatures.
- Resists security threats from malicious sources which can then cause the realization of the issues in #1 or #2 above.
The Current State
In July 2015, white-hat hackers Charlie Miller and Chris Valasek took over a 2014 Jeep Cherokee being driven by journalist Andy Greenberg. As Greenberg drove at 70 miles per hour, Miller and Valasek—sitting in a location 10 miles from the vehicle—manipulated the car's air conditioning, the radio, the dash display, activated and deactivated the brakes, and eventually brought the car to a complete stop. After Greenberg's story made national news, Fiat Chrysler recalled 1.4 million Cherokees, only to be embarrassed again two months later when it issued a recall for 8,000 SUVs with the same unsecure UConnect component.
An archive of car hacking stories is maintained on the Security Affairs website. As a software engineer, I am deeply disturbed that industry experts view automobile manufacturers as being far behind the rest of the software industry. According to Ed Adams, a researcher for Security Innovation which tests automobile safety, “Auto manufacturers are just behind the times. Car software is not built to the same standards as, say, a bank application. Or software coming out of Microsoft.”
As we careen toward a ubiquitous Internet of Things, we are challenged with new and disturbing vulnerabilities that in hindsight we ask, "Why didn't someone anticipate that?" Sergey Lozhkin of Kaspersky Lab reported stunning exposure in medical devices. Lozhkin focused his attention on medical offices to determine how vulnerable their medical devices were. Once he picked his target hospital, he said, "I tested the WiFi network and finally I was able to connect to an MRI device and find personal information and [flaws] in the architecture. It was scary because it was really easy. The initial vector was the WiFi network; the network was not really as secure as it should be in such a place where you keep medical data.”
His reports end with an ominous vision. Lozhkin said, "The white-hat security researchers and the bad guys are deeply researching this area – car hacking, connected cars, medical devices, everything. For cyber criminals it could be a big market.”
Non-Safety Is Subtle
The Hippocratic Oath says, "Above all, do no harm." But safety failures can be subtle. A message sent twice because of a network delay, a one-phase commit into a database: we often see these as abstract technical problems until we characterize the message as a dosage order for a dose-critical drug, or the database is left in an unrecognized, corrupt state but still fully operational. A pervasive situation is the storing of objects with state variables into a database. If those variables are changed in the database the restored object may well be in an undefined state.
Better testing is only part of the solution for safe software. A mindset of aggressive verification of safety is essential, but how do we determine that our software has vulnerabilities?
It is not always self-evident, and often our view is clouded by a failure of imagination. For example, can a word processor cause harm?
Consider that you prepare a word processor document such as the one below. The " <\t> " symbol represents a TAB character.
You then save the document, and sometime later read it back into your word processor. The document is different now:
We see that a TAB character has been lost. Well, how dangerous can that be? The answer is in what I neglected to put in the original snippet. I left off the headers corresponding to the columns identified by the tabs. Here is the "real" document after reading it back in:
Sorry, Chester! Fixed in next release! (This example came from a U.S. Navy Test Plan for document fidelity. I have modified the scenario a bit for simplicity.)
Real-World Solutions
To determine safety we should not rely only on tests of our product code. Defect count and safety are independent. Zero defects can still mean unsafe software because the defect is defined relative to prescribed behavior and our prescription may be incorrect. This is why so many defects are surprises: "How did that happen?", "That should never have happened," or "I never thought about that."
A valuable analysis technique we can use in our product development is Fault Tree Analysis (FTA). FTA produces visual models of the events or paths within a system that can cause a foreseeable, undesirable loss or a failure of a component. Conceptually, Fault Tree Analysis allows you to identify end states that are not safe or intended, and the modeler works backward to specify the conditions which could produce that end state.
FTA incorporates standard Boolean logic concepts (AND, OR, XOR) and others as 'gates.' Paths and events feed into these gates which produce outcomes of the undesirable states. The example below is for the target state of severe headache and the potential causes of that condition.
FTA is a powerful technique best-suited for completed designs with known end-states, and is founded on a philosophy that failure stems from hardware or software component failure. Fault tree diagrams are relatively easy for any technical person to create, and Microsoft Visio has an FTA template built-in. I have found Fault Tree diagrams plus UML State Machine diagrams to be invaluable at understanding what could go wrong in the products I have developed with my clients.
A complementary alternative to FTA was created by software safety expert Nancy Leveson. STAMP (System-Theoretic Accident Model and Processes) is a technique that focuses on systems and the interactions among components rather than on individual components themselves. Leveson's theme is that failures occur when the system gets into a hazardous state, which in turn occurs because of inadequate enforcement of the safety constraints on the system behavior.
Because it has a system perspective, STAMP is not as simple as drawing Fault Tree diagrams. STAMP has rich semantics and supports multiple perspectives on levels of causation including: event chains, causal conditions, and systemic contributors that create the fault conditions. Because STAMP also incorporates constraints, hierarchical levels of control, and process models, it can be applied not only to software development but also to the process of product development and delivery.
Why Does This Apply to My Web Application?
Software product development has irreversibly evolved from single applications into delivering what could be called "softwares"—multiple, interacting units of functionality, in multiple languages, running on multiple platforms concurrently, each produced by different groups or companies. We live in a world of systems of systems. Interactions at interfaces are ubiquitous points of potential failure. We should embrace every tool that will give us additional clarity into the safety of what we release.
Safety concerns plague us from requirements specification through release. To confront these concerns we should not wait for a perfect answer: even a piecemeal approach can be invaluable if we focus on achieving additional safety.
UML state machine diagrams help us identify states or state changes that are dangerous or might introduce corruption. A common example is a "catch all" state that is entered when programmers do not know what specific logic to execute. While a component is in this contrived state an event or message could occur which results in incorrect processing.
Enumerating multithreading scenarios is always problematic. We should attempt a comprehensive analysis to determine critical regions, the possibility of race conditions, and the potential for deadlock and other multithreading anomalies.
When our code detects an error should we shut-down the thread, the process, or the application? Or should we continue running in a degraded mode? We should list the relative benefit and risk of each possible decision with our focus on safe operation.
The complexity of modern applications can be staggering, and Connected Vehicle and medical device hijacking are only the latest negative headlines our industry has faced. Some software products have already killed.
In the late 1980s our industry was hit with its first, infamous example of lethal software failure: the Therac-25. This medical diagnostic machine massively overdosed six people with x-radiation.
The Therac-25's tragedy had many causes which included:
- Reuse of software from the predecessor model, the Therac-20.
- User-interface input field changes that were not registered to the control subsystem.
- Race conditions.
- Software-only control functions without hardware interlocks.
Three of the six overdosed patients died as a result of their exposure to the radiation from the machine.
Was this a consequence of a lack of testing? According to Leveson the manufacturer performed extensive testing, but perhaps not the right testing. Hardware simulators were the main test bed for the software. It is tempting to assume if software has no bugs, it must be safe, but this is patently not a true statement. In the middle of her report, Leveson makes a statement that could apply as readily today as to the PDP-11 assembly code of the Therac-25:
The basic mistakes here involved poor software engineering practices and building a machine that relies on the software for safe operation. Furthermore, the particular coding error is not as important as the general unsafe design of the software overall.
The troubling part of this statement is that the design of the software overall for a typical software product today is generally quite beyond our ability to understand it with precision. We are confronted by a paradox. In Agile development we deliver small chunks of functionality so that we can more fully understand the requirements, design, and proper execution of the code. But as each chunk is added to its predecessors the interactions of our system of systems grow almost exponentially, diminishing our understanding and trust of what we have delivered.
Maturity Takes Time
FTA, STAMP, and UML models are tools to help us produce more reliable systems. Failure reports in our industry press suggest we should aggressively increase our due diligence on the safety of our products in addition to correct functionality. Vulnerable wireless vehicle connectivity, exposed medical devices on open hospital networks, baby monitors broadcasting adult voices, internet-enabled smart TVs that allow hackers to listen to your conversations… we have our work cut out for us to make our digital world even a little safer.
About the Author
Gary K Evans is an Agile transformation consultant and for 18 years has helped teams and organizations adopt Agile methods. He is a Certified Scrum Master, a SAFe 4 Program Consultant, and Design Patterns expert, and he has coached dozens of projects using Scrum and Extreme Programming. He has spoken at over a dozen software conferences and for 6 years was a Contributing Editor for Software Development Magazine. He was a Judge for the prestigious Jolt Awards for Software Productivity for 14 years. He is an acknowledged expert in UML, object-oriented analysis and design, Use Cases, and is an associate of Alistair Cockburn. He has authored over a dozen training courses including Pragmatic Business Analysis, Agile Project Management, Agile Development with Scrum, Pragmatic Use Case Writing, Service-Oriented Analysis, and Object-Oriented Analysis & Design.