BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles SeaMonkeys - Chaos in the War Room

SeaMonkeys - Chaos in the War Room

Key Takeaways

  • Naval combat systems not only have to be robust to the rigours of the sea and the violent nature of war, but also at times the somewhat unpredictable nature of the system users. 
  • Chaos testing, that is deliberately inducing a failure into a system, is nothing new in engineering, but before automated randomised failure injection we needed other means.
  • One such method, used by the Royal Australian Navy Combat Data Systems Center to test combat systems, was to take advantage of regular training courses for junior sailors to allow randomised testing under close to live conditions.
  • This approach demonstrated the reliability of the hardware, improved the understanding of the system, and drove continuous improvement in critical areas such as documentation.
     

Naval Warfare is no game.  The stakes are high and things have to work under the most adverse conditions imaginable.  You are battling not only adversaries, but the elements, constant motion and the worst nature can throw at you.  The Combat System (sometimes called the Naval Tactical Data System) is a vital element of the vessel's operational effectiveness and absolutely vital for self-defence.

A combat system however is not just a bunch of computers, it's a complex harmony of man and machine.  You are dealing with life-critical systems, the behaviour must be defined in every situation, the system must be by default safe and it must remain in that state as long as possible when under stress.

Chaos Engineering is a means to achieve resilience under stress.  This involves not just deliberately introducing failures into a system but doing so in an unpredictable way.  Ideally into a live or live-like environment.  Deliberately inducing a failure is nothing new in engineering, but before automated randomised failure injection we needed other means. Just like the systems themselves, they were a combination of man and machine.

From 1974-2006 the Royal Australian Navy (RAN) Combat Data Systems Center (CDSC) maintained the systems responsible for the combat capabilities of the RAN Frigates and Destroyers.  Its remit included testing and development of these systems as well as all aspects of training from maintenance through to end-user operation of the system.  Testing such critical systems however was a key element of the work.

In order to test and train, the CDSC maintained full operational suites of what could be found on board ship, including simulators for external components that it didn’t make sense to maintain locally. (A 3 inch gun isn’t something you typically keep in your server room, although I know a few who would like to think they had a use for one.)

The systems I am referring to are retired from service, long since gone from the RAN, however they were impressive feats of engineering. They were to put it mildly, robust. As well as the rigours of being at sea they were designed to survive combat, or to at least survive long enough to do everything possible to ensure the survival of the ship's crew. They were literally built to take a bullet and keep running.

Whilst there were many computers and connected systems, the heart of the Combat System was the AN/UYK-43 lovingly referred to as the ‘yuk-43’.  It was a 32-bit computer featuring active redundancy including multiple redundant memory banks, CPUs, I/O controllers, I/O busses, power and cooling.  On top of that the software was designed to cope with failures and switch modes should it be needed.  

The ‘yuk-43’ was the size of a refrigerator, weighed about 1500 lbs and needed about 4.5kW to run (HP-9020C/AN/UYK-43 Study). Only around 1200 or so were made and they are now probably worth their weight in gold for spare parts.  Whilst the RAN no longer uses these machines, they are probably still in active service with the US Navy and others.

But all of this, as a system, needed testing. A million test scripts would never be enough to catch every possibility, every input, every variable that could arise. Testing a new patch or release could take weeks or months of effort involving a half dozen or more people. Full cycle testing included not only on-board at harbour tests on at least one vessel but also at sea testing potentially with multiple vessels and aircraft.  Needless to say this is difficult to coordinate and a very serious investment.

The software we were working with at the time would today be considered laughably small.  In most situations there were only 64K 32-bit words of memory available.  However you can do a lot with that memory when you are not having to deal with a general operating system and you don’t have any bit mapped screens to deal with.  There were no command line terminals, no graphic displays, typically no keyboards.  Consoles were dedicated, built to meet the specific needs of the system, they didn’t have to move megabytes of video around, they had simplified keypads and track balls, imagery was produced in the console itself from data points typically overlaid on the output of radar or other systems.

Our browseable code repository was shelves and shelves of three-ring binders of dot-matrix printouts.  And the code was primarily comments because the code was typically in a language called CMS-2, which was a procedural language sitting in a space somewhere between Assembly and BASIC; not easy to follow and there was none of the modern tooling around code navigation or analysis.  There was also considerable Assembly code in critical areas such as schedulers.

Another means of testing ‘in production’ was developed to drive improvements and capture issues that may be difficult to ‘script’.

Training courses cycled through the organisation regularly, sailors would be trained on the various aspects of the systems, including their operation under simulated conditions.  This provided an opportunity not only to observe the system in use, but also to ‘mess with the program’.  We had the perfect situation to inject some chaos and observe the consequences.

By using these training courses we could test under close to ‘live’ conditions. Whilst we had a set of scripts we would run in standardised fashion we also experimented with some more unpredictable methods.

Before a training session started we would take a junior sailor who was yet to be trained on the systems under test and do the following:

  • show them into the computer room, explain the safety protocols.
  • show them the computer systems, how the latches worked, how to unplug modules safely (they were heavy).
  • show them the I/O panels. These are a wall of rotary switches that connect the computer to various systems, in our case a mix simulators, other computers and the consoles the trainees would be using.
  • We’d take care to show them the identifiers on all of these things, how everything was labelled and the importance of these labels.

We were careful not to explain what any of this really did. The sailors had a real mix of education, some of them had a basic notion of what was going on, some did not have a clue and we had to be careful to explain the mechanics of the situation.  “No don’t throw the modules on the ground, or the Chief will not be happy….”.  Often they asked questions; we’d say we’d talk to them when it’s done and explain.

The process would be laid out like this:

  • After the introduction we would give them a notepad, a watch, two dice, a start time and end time.
  • They would undertake a set of steps, isolated from the training room and the engineers monitoring the systems.
  • At the start time they would roll one die, the number that came up would be the 10 minute slot in which we would want them to do something.
  • Then they would roll the two dice, that would give them the minute within the slot (if the sum was more than 10 then roll again).
  • Once the allotted time was reached they were to write down the time.
  • They then had to perform an action, pull a module, flip a switch, hit a button - whatever they wanted to do, we were not to know and we would not be present.
  • They then had to write down the time again, followed by a list of actions they took during those two times.

In theory we were blind to what action was going to be taken, or when it would be taken: in reality also if any would be taken.  On one occasion the sailor fell asleep, on another he wandered off etc.: actions were not reliable.  Surprisingly only once in my memory did any of them forget to write down the steps they had taken.

During the testing period we would be in the console room with the trainers, we’d observe the teams running through their simulations and we’d be on the look out for anomalies.  Our ability to monitor the internals in real-time were pretty limited.  These systems had limited diagnostics, and what you could get took months of use to really understand it (how good are your octal reading skills?) so we really relied upon being familiar with the systems and what was normal to catch anything realtime. For the most part we relied on trawling what we were able to record from specialised gear that would monitor the various interfaces and consoles.

During the course of my time, I saw at least a dozen runs of this style of testing. Its effectiveness to a new engineer seemed minimal and hardly worth it, but being wet behind the ears I missed many of the benefits:

  1. We proved the robustness and resilience of the system hardware. During that period I remember only one trouble report related to hardware behaviour - an aging switch panel element that needed replacing.
  2. We identified trouble reports that were never likely to have been captured through normal testing scenarios. Whilst none were deemed critical, some caused confusion for the operator - any confusion in the heat of the moment could have dire consequences.
  3. We improved our understanding of the system. In life-critical systems understanding is vital, having documentation even more so.
  4. We introduced an element of continuous improvement that otherwise would have been lost.  As we were exercising the system more, observing more, testing more, we improved documentation, we improved processes and increased confidence in these as we went into full testing cycles.

I failed to fully appreciate what I was involved with at the time, both in terms of the technology and the efforts that the organisation went through to seek to improve things in such a rigid environment. This has a distinct link to the SRE movement, particularly those bringing learning from other domains, often critical safety domains into modern software development and operations.  I’m thinking in particular about the work of David Woods in characterizing the true nature of a system and the acknowledgement of the human element as well as those of Nora Jones and John Allspaw among many others.

One element that has only in the last few years come to the forefront of my mind, is the human-machine interaction safety.  In knowledge work most of our users are educated people, they typically are at least high school educated and more often than not have a degree or a level of higher education.  But these systems had to be operated by young sailors, many had left school at age 16, they often had limited technical literacy, and quite often struggled with complex tasks.  The systems not only had to be robust to the rigours of the sea and the violent nature of war, but also at times the somewhat unpredictable nature of the system users.

Our testing during training in reality was two fold: our efforts at generating chaos at the system level, but also the uncertain and unpredictable responses of untrained system users.

It has made me ask myself repeatedly since, what can I learn from this situation and if it was repeated how could I make adjustments in order to learn more or better refine what I learnt.

About the Author

Glen Ford is ever learning, currently building e-Commerce capabilities in the Beauty industry at COTY, he was formerly CTO of Beamly, architect at Unibet and lead engineer at BBC R&D Prototyping Group.  Glen has experienced the highs and lows of startups as well as surviving corporate worlds of the telecoms and defence industries. 

Rate this Article

Adoption
Style

BT