CrowdStrike, an American cybersecurity technology company, recently released a product update that bricked an estimated 8.5 million computers running Windows globally, affecting businesses, individual users, and software companies. CrowdStrike provides cloud workload protection, endpoint security, threat intelligence, and cyberattack response services to secure critical areas of risk and prevent hackers' breaches.
The problematic update affected the core components of CrowdStrike's Falcon agent, a critical piece of software designed to detect and prevent security threats. According to initial investigations, the issue stemmed from a conflict between the update and specific low-level system files in Windows. Other operating systems, such as Mac and Linux, were not impacted.
Specifically, the update caused an incompatibility with the Windows kernel, the core part of the operating system responsible for managing hardware and system resources. This incompatibility led to a failure in the boot sequence, resulting in what is commonly known as a "bricked" machine — a device that cannot start up or function.
On one of the many Reddit threads, a respondent explained:
Crowdstrike pushed an "unskippable" update to all of their phone-home endpoints. Anyone set with an N-1 or N-2 configuration (where N represents the most recent version of the software, and the -# is how many versions behind someone chooses to be) had that option ignored.
This is logical for this product in some sense. A 0-day fix needs to be propagated immediately. Being N-1 on a 0-day is not wise.
Everyone believed that CrowdStrike was doing its due diligence in staging before pushing it out to the rest of the world. Obviously, someone in CrowdStrike skipped a step. Whatever approval/implementation system they used failed them. Anyone using the CrowdStrike program got the update and died. "Blue Screen of Death (BSOD) as a Service."
In addition, a respondent on a Hacker News thread wrote:
This is a global multi-layer failure: Microsoft allowing kernel mods by third-party software, CrowdStrike not testing this, DevSecOps not doing a staged/canary deployment, half the world running the same OS, things that should not be connected to the internet but are by default. Microsoft and CrowdStrike drove a horse and a cart through all redundancy and failover designs and showed very clearly where no such designs were in place.
CrowdStrike responded swiftly by halting the update's rollout and working on a patch to resolve the issue. The company provided detailed instructions for affected users to restore functionality, including booting into safe mode and uninstalling the problematic update – which means a lot of work. In a Reddit thread of the CrowdStrike BSOD issue, a respondent wrote:
I am very interested in the scale of resolving this globally because if it's causing hardware to boot-loop with BSODs, you won't be able to deploy a patch/ script to fix it. We're going to have to go to every boot-looping machine and manually fix it!
Furthermore, Microsoft released a recovery tool to help IT admins with the repair process.
Shyam Sundar, Cloud Architect at Novac Technology Solutions, concluded in a Medium blog post on the details of the BSOD disaster with CrowdStrike:
This has been a disaster of monumental proportions for many businesses worldwide. We are yet to see what measures companies will take to prevent such incidents from happening again. Some A/B testing or staggered rollout would likely have prevented such a massive outage.
Lastly, CrowdStrike Founder and CEO George Kurtz stated that the company will provide full transparency on how this occurred and the steps to prevent anything like this from happening again.