Key Takeaways
- NotPetya caused substantial disruption, and it highlights numerous measures that could have made recovery quicker and cheaper
- We should expect future attacks to be more destructive, so recovery plans need to be more robust
- Moving from a trusted perimeter model to a zero-trust model provides a stronger defensive posture and sidesteps many of the issues with quarantine
Background
In June of 2017 a cyberattack was launched against Ukraine, using a derivative of the Petya malware. The initial source is believed to have been an update to the local accounting package MeDoc. Within hours, the malware spread, starting with their offices in Ukraine and moving across the networks of global companies, typically annihilating the entire Windows desktop and server environment in its wake. The NotPetya malware appeared to be cryptolocker ransomware, like the WannaCry outbreak that happened just weeks before, but it wasn’t. There was no ransom, and no amount of BitCoin was getting back affected disks or the data on them. Documenting the cleanup effort and its impact on the operations of a number of large multinationals, Wired described what happened as ‘The Most Devastating Cyberattack in History.’
As we hit the second anniversary of NotPetya, this retrospective is based on the author’s personal involvement in the post-incident activities. In the immediate aftermath, it seemed like NotPetya could be the incident that would change the whole IT industry, but it wasn’t—pretty much all the lessons learned have been ignored.
It could have been much worse; next time it will be
NotPetya encrypted the C: drives of the Windows machines it infected. It didn’t touch D:, E:, or F: drives etc., and it didn’t affect Linux, Unix, Mainframe, or Midrange machines. The malware was apparently designed for desktops (running the MeDoc accounting package), so its appetite for destruction was limited. A simple iterator that worked through any additional drives/partitions would have destroyed Exchange Servers, SQL databases, and file servers, where the data living off system disks were left untouched.
Ruining the system disks of entire fleets of Windows desktops and servers had a huge impact, but largely didn’t affect the ‘books and records’ of the companies concerned. That data was unscathed because it resided on other disks or non-Windows systems.
As we plan for the next attack and its collateral damage, these plans should consider not just restoring access to systems of record (which was the main activity post NotPetya), but also preserving those systems of record. We should expect malware that comprehensively destroys Windows systems, and makes use of vulnerabilities in other operating systems (exposed to compromised Windows machines). This shouldn’t just include PCs and servers—NAS filers, SANs, and networking equipment, unscathed by NotPetya, are also potential targets for destruction.
What can be trusted?
One thing that hampered initial recovery was the question of how the malware had propagated, and how long had it been latent? If we’re going to wind back the clock to a known point of safety before the attack, then how far back is that? Everything previously considered to be ‘known good,’ is at best ‘known vulnerable,’ and at worst—‘infectious.’ Establishing how and when the compromise took place becomes a priority.
Getting the forensics to understand where the safe point lies can be a time-consuming process, but in the meanwhile, any recovery is potentially undermined by the potential for malware to be re-released. There’s also a really ugly tradeoff between quarantine for rebuilding core systems and services and access to systems of record, so the company can keep running.
The root problem lies with the circles of trust established prior to the incident. Trusted networks and identity-management systems (like Active Directory) can no longer be trusted once compromised, but the qualities of that trust are normally a fundamental design choice impacting a huge array of operational considerations. Google (following its own compromise by Operation Aurora) switched to ‘BeyondCorp,’ a zero-trust model, which shares characteristics with the Jericho Forum approach of de-perimeterisation.
Recovery at scale
Backup and recovery arrangements are usually conceived around single-system failures in benign conditions. If it takes an hour or so to restore a disk, that’s probably not a problem. But, if you need to restore 5,000 servers, and they sequentially take an hour each, that’s 208 days—almost 30 weeks. How many backup servers do you have, and how much restoration can be parallelised?
Traditional backup and restore aren’t designed for recovery at scale, and even more so when the backup servers themselves have been impacted by the incident. However, there are two approaches that can work well:
- Rollback to snapshots—As systems have been increasingly virtualised, it’s become easy to take snapshots of known good configurations, and copy-on-write (COW) filesystems have made this reasonably efficient in terms of storage.
Of course, snapshots can only be used for rollback if they’re taken in the first place. One of the tragedies of NotPetya was seeing freshly installed infrastructure that could have been restored in seconds, trashed due to the lack of snapshots, because ‘that’s done by another team’ or ‘there wasn’t enough storage capacity.’ - Repave via a continuous delivery pipeline—Do this with a newer, more resilient version that’s invulnerable to the attack.
This is a reminder that traditional ‘patch management’ is a parallel release to the production pipeline, that’s used because the mainstream mechanism is too slow
Sweating assets is a false economy
One hindrance to the NotPetya recovery for many companies was the use of old (often end-of-life) equipment. Hundreds of ancient servers were being used, when a handful of new ones would have done the job. Storage space for snapshots and virtual tape libraries (VTL) wasn’t available because SANs hadn’t been upgraded for years.
The driver for using old equipment was capital expenditure (CAPEX) and accountants thinking they were cleverly sweating more value from an often fully depreciated asset. The problem—old equipment consumes much more power and hence runs up substantial operational expenditure (OPEX). Koomey’s law (a close cousin of Moore’s law) lets us figure out a break-even point between CAPEX and OPEX that typically drives a 3-year equipment refresh cycle. If servers and storage are older than 3 years, it’s likely the electricity cost for your private, computer history museum is dwarfing the cost of the shiny new kit, which should be easier to maintain and simpler to recover.
Cyber insurance doesn’t cover ‘acts of war’
The post NotPetya recovery actions were time-consuming and costly. Companies with cyber insurance policies made claims to cover those costs. In at least one major case (a $100m claim) the insurer used an exclusion for "hostile or warlike action in time of peace or war" by a "government or sovereign power." This led to a (still-in-progress) court case, which may lead to the insurer having to prove attribution of the attack.
Ironically, cyber insurance may ultimately prove to be pivotal for ensuring that companies are better prepared for future attacks. It has long been postulated by the economics of the information security community, that insurance may achieve improvements in safety not achievable through direct legislation (for example, as has happened with young riders being priced out of high-performance motorcycles). For that to happen, a stronger mandate for cyber policies is needed (which may have to be legislated), and insurers will have to become more active in defining minimum standards they expect from policyholders (just as they do with window locks for home contents).
The future is about recoverability, not just defence
Each of the companies impacted by NotPetya (and WannaCry before it) had some degree of security protection in place—the usual stuff like firewalls, antivirus, and patch management. That defense obviously wasn’t perfect or the attack would have been thwarted, but a perfect defense costs $∞ and is therefore impractical. As we deal with the realities of an imperfect defense, it becomes necessary to choose between preventative and reactive measures. Security expert Bruce Schneier makes the point on his resilience tag: ‘Sometimes it makes more sense to spend money on mitigation than it does to spend it on prevention.’ An investment in mitigation can also pay off in all kinds of ways that have nothing to do with attacks: that change that was just accidentally made to production when it should have been in test—fixed in seconds, by reverting to the last snapshot.
Moving toward a zero-trust model
NotPetya is unlikely to keep its ‘most devastating cyber attack’ title for long. There will be another attack, and we should expect it to be worse. Moving away from a trusted network model to a zero-trust model is the most effective way to defend against such attacks. But, effort should also focus on measures that allow speedy recovery. Recoverability can be aided by moving to modern equipment and software, and generally, there will be a sound business case to support that move.
About the Author
Chris Swan is Fellow, VP, CTO for the Global Delivery at DXC.technology, where he leads the shift towards design for operations across the offerings families, and the use of data to drive optimisation of customer transformation and service fulfilment. He was previously CTO for Global Infrastructure Services and General Manager for x86 and Distributed Compute at CSC. Before that he held CTO and Director of R&D roles at Cohesive Networks, UBS, Capital SCF and Credit Suisse, where he worked on app servers, compute grids, security, mobile, cloud, networking and containers.