This is a follow up to 'Meltdown and Spectre: What They Are and How to Deal with Them', taking a deeper look at: the characteristics of the vulnerability and potential attacks, why it's necessary to patch cloud virtual machines even though the cloud service providers have already applied patches, the nature of the performance impact and how it's affecting real world applications, the need for threat modelling, the role of anti virus, how hardware is affected, and what's likely to change in the long term.
Spectre and Meltdown are example of side-channel attacks and the Project Zero blog post provides a thorough but complex explanation of the specifics. Raspberry Pi founder Eben Upton simplifies things in his blog post explaining why Pis aren't affected, "Meltdown and Spectre are examples of what happens when we reason about security in the context of that abstraction, and then encounter minor discrepancies between the abstraction and reality". Graham Sutherland gets to the heart of things in his Twitter thread explainer:
There are two important exclusions to the rollback of side-effects: cache and branch prediction history. These generally aren't rolled back because speculative execution is a performance feature, and rolling back cache and BHB contents would generally hurt performance.
This also points to why there's a performance hit introduced by the various patches to deal with these issues, as they're essentially making sure that cache and branch prediction history are rolled back at times when a side-channel attack might be staged. For an entirely non technical explanation Joe Fitz has come up with a books in library analogy.
Daniel Miessler provides a useful chart as part of his 'A Simple Explanation of the Differences Between Meltdown and Spectre':
The point about patching being more nuanced for Spectre hints at the name choice. Meltdown is relatively straightforward to deal with (even if there is a performance hit), whilst Spectre is a category of issues (with more perhaps to be discovered) that are likely to haunt us for some time.
Google, AWS and Azure have all made statements that amount to "we've (mostly) finished with patching our cloud, now it's your turn to patch your guest OS". Unfortunately those explanations haven't provided much detail on why it's necessary to patch both the hypervisor and the guest OS. Security researcher Katie Moussouris highlighted this extract from Robert O'Callahan's blog post: "It's important for the the CPU vendors and the cloud vendors to say exactly what mitigations they have deployed, what attacks they are not mitigating, and what parts of the problem they expect their downstream customers to take responsibility for". The AWS announcement does explicitly state, "Customers' instances are protected against these threats from other instances". The implication is that VM operating systems still need to be patched to prevent attacks between apps or against the kernel within a given VM, and this was confirmed by Amazon's Richard Harvey.
The performance impact of the patches for these issues comes when applications are making syscalls to get the OS kernel to do something on their behalf. Loopback tests therefore present the worst case, as they spend all of their time doing syscalls and none of their time doing useful work. Intel provided a rollup of statements from Apple, Microsoft, Amazon and Google that generally downplay the impact of the patches. Cloudflare's CTO John Graham-Cumming observed, "We continue to test various patches for #meltdown and #spectre but impact on @Cloudflare infrastructure appears to be negligible.", and Aeron creator Martin Thompson noted, "On Windows Aeron is doing surprisingly well with Intel bug patch. Seems the increase in syscall cost is causing Aeron to batch more and thus increase throughput.", though in the latter case there was an impact on latency. There are however reports emerging of more severe performance impact. Syslog_NG evangelist Peter Czanik found that compile times on Fedora up from four minutes to 21 (though it was less bad on OpenSUSE). Branch director of engineering Ian Chan observed "The #Meltdown patch (presumably) being applied to the underlying AWS EC2 hypervisor on some of our production Kafka brokers [d2.xlarge]. Ranges from 5-20% relative CPU increase. Ooof."
Microsoft's Jess Frazelle makes a key point about threat modelling, "Think about your threat models before thinking the sky is falling. The bugs require access to the machine. Think about if you are running code you don't control... that's basically your concern". For most server based systems they'd have to already be compromised in some other way for Spectre or Meltdown to be exploited, and even then the bugs don't provide a direct means to privilege escalation (though that may be trivial if root passwords can be found). End user devices are more at risk, as the use of web browsers is pretty much ubiquitous, and Javascript in ads etc. provides a route to compromise, which is why browsers are being patched in addition to hypervisors and operating systems.
Anti Virus (AV) also has a part to play. At the moment Microsoft is checking that AV vendors have set a registry key to indicate that it's safe to apply patches, as otherwise it's possible for AV to attempt memory accesses that clash with the patches and crash Windows. In time host intrusion protection systems (HIPS), a category with substantial overlap with AV, are likely to spot attempts at exploiting Spectre.
Intel has already released firmware updates to OEMs that include CPU microcode fixes, though clearly the rollout of software patches shows that these updates aren't sufficient on their own. It will take time for those updates to propagate through to end users, and Microsoft serves as an early example with updates to their Surface hardware. AMD's statement downplays the vulnerability of their CPUs suggesting that 'negligible performance impact' is expected from the patches. ARM provides a table noting which of their architectures are affected, and their whitepaper has been broadly complemented for its concise, thorough and mostly vendor agnostic explanation of the issues and remediation approaches. Curiously the ARM based Apple A series CPUs as used in iPhones and iPads are also affected by Meltdown, suggesting that they've borrowed from x86 in how they approach speculative execution. ARM have also described a 'variant 3a' Meltdown type attack that can be used against some of their core designs, and proof of concept attack code is available. The CSDB barrier operation described in the whitepaper to mitigate attacks is already present in ARM CPUs, suggesting that the possibility of such attacks was anticipated, but the mitigation overhead was deferred until the attacks materialised.
Taking a long term view, two key points emerge. Firstly. it's likely that Spectre will present risk for some time - at least until present hardware is replaced, which for embedded systems could be decades away. It's also possibly too late to fix the next generation of CPUs, at least not without major delays to their release cycle. This saga highlights that CPU protection rings are really just another abstraction - painted lines and a referee with a whistle rather than hard walls. Perhaps the generation of CPUs that address these issues properly will have to be designed with hard walls for protection tings. As Joe Fitz notes - "We may need to go back to the drawing board".