Investigating near misses by gathering data from the field and exploring anything that looks wrong or is a bit odd can help to prevent disasters, said Ed Holland, software development manager at Metaswitch Networks. At QCon London 2019 he gave a talk about avoiding being in the news by investigating near misses.
"We are all over crashes and making sure they never happen again the same way", said Holland, "but we also need to investigate near misses. In order to investigate those we need to instrument our product, monitor those stats, and dig into everything that doesn’t look right, he said."
Holland mentioned that often anomalies were simply behaviours they were not expected, but were completely legitimate, and added to their understanding of how the product was being used, said Holland. A decent number of cases were the result of misconfigurations or product bugs.
Holland gave some examples of the problems they investigated which had little or no immediate customer visibility or impact, but by fixing prevented more severe problems later.
Bad load balancing: On some types of compute nodes, load varied significantly. Tracking this down was straightforward using the statistics graphs, as they showed suboptimal behaviour in 3rd party equipment, including DNS cache resetting every eight hours. By fixing this we prevented premature overload behaviour.
Spikes in quiescing: Our stats showed odd spikes in errors shortly after we handled an exception and entered quiescing behaviour. After digging through call flow diagnostics, this showed some calls (legitimately) being retried to the queised node. This resulted in a code fix, and although the loss in service in this case was small, the fix was of wider impact, for example in upgrade processing.
Performance modelling: By using linear regression on field data we created a model of the load from various operations. Applying that model to the different sites, we were able to show that one OpenStack rig was poorly configured compared to the others.
Every anomaly is interesting and tells a story, said Holland. Often the story is simple: the customer did something that was completely sensible, but that we were not expecting, he said. But enough of the time the story is that there is a bug in the way the system is set up, or with the software itself.
The key is to be proactive in gathering and looking, in detail, at the anomalous data coming back from the field. This used to be an onerous task, but relatively recently has become much easier to implement using modern instrumentation techniques, said Holland. Investigating everything, which has been referred to as investigating "dynamic non-events", gives you a deep understanding of the way your product behaves in the field. That understanding means that instead of waiting for a disaster and then investigating/fixing, you have a much better chance of predicting and fixing problems before they occur. Moreover, as this approach is based strongly in reality, it is a more cost-effective way to achieve a more stable product than, for example, additional lab testing, argued Holland.
InfoQ interviewed with Ed Holland about what made them decide to investigate near misses, collecting reliable data, and how organisations can prepare themselves to investigate near misses.
InfoQ: What made you decide to start investigating near misses?
Ed Holland: Our customer was planning to migrate 10M mobile subscribers in the space of a month or so.
Having done risk analysis and pre mortems, additional testing, and fixing, it still didn’t feel like we’d done enough.
That scale and pace was far beyond what we’d dealt with before, and so we needed information we could use to avert problems before they occurred. That meant investigating anything and everything, problem or not.
InfoQ: What can be done to collect sufficient and reliable data?
Holland: Collecting, storing and visualising data reliably, and at scale, is now fairly straightforward using open source technologies like influxdb, prometheus and grafana. The problem then reduces to the extent of the changes in your product code. In a perfect world I’d have stats for literally everything, counting in and out every request and response (by error codes for example), and then also detailing the internal workings of the algorithms (like the overload algorithm).
InfoQ: How can organisations prepare themselves to be able to investigate near misses?
Holland: There are two things organisations need to be able to investigate the near misses. One is to be prepared to invest heavily in the instrumentation of product code, and the peripheral components that allow the storage and visualisation of the data. The other is a mindset shift from reactive investigation of tickets raised by the customer, to one of proactive investigation by someone knowledgeable who understands the product deeply and can spot things that aren’t right.
These changes need something of a leap of faith for budget owners. There is no obvious immediate gain like there is for resolving tickets, implementing features, or for development infrastructure investment. Instead, the ask is for funding to gain a better understanding of the product and the field usage, in order to reduce future costs.