How One Bad CrowdStrike Update Crashed the World’s Computers

Lily Hay Newman Matt Burgess Andy GreenbergJuly 19, 2024

0 21 6 minutes read

how-one-bad-crowdstrike-update-crashed-the-world’s-computers — How One Bad CrowdStrike Update Crashed the World’s Computers

Only a handful of times in history has a single piece of code managed to instantly wreck computer systems worldwide: The Slammer worm of 2003. Russia’s Ukraine-targeted NotPetya cyberattack. North Korea’s self-spreading ransomware WannaCry. But the ongoing digital catastrophe that rocked the internet and IT infrastructure worldwide over the last 12 hours appears to have been triggered not by malicious code released by hackers, but by the software designed to stop them.

A software update from cybersecurity company CrowdStrike appears to have inadvertently disrupted IT systems globally.

Two internet infrastructure disasters collided on Friday to produce disruptions around the world in airports, train systems, banks, healthcare organizations, hotels, television stations, and more. On Thursday night, Microsoft’s cloud platform Azure experienced a widespread outage. By Friday morning, the situation turned into a perfect storm when the security firm CrowdStrike released a flawed software update that sent Windows computers into a catastrophic reboot spiral. It remains unclear if or how the two IT failures are connected.

The cause of one of those two disasters, at least, has become clear: Buggy code pushed out as an update to CrowdStrike’s Falcon monitoring product, essentially an antivirus platform that runs with deep system access on “endpoints” like laptops, servers, and routers to detect malware and suspicious activity that could indicate compromise.”Falcon requires permission to update itself automatically and regularly, since CrowdStrike is constantly adding detections to the system to defend against new and evolving threats. The downside of this arrangement, though, is the risk that this system, which is meant to enhance security and stability, could end up undermining it instead.

“It’s the biggest case in history—we’ve never had a worldwide workstation outage like this,” says Mikko Hyppönen, the chief research officer at cybersecurity company WithSecure. Around a decade ago, Hyppönen says, widespread outages were more common due to the spread of worms or trojans. More recently, global outages have happened on the “server side” of systems, meaning outages often stem from cloud providers such as Amazon’s Web Services, internet cable cuts, or authentication and DNS issues.

CrowdStrike CEO George Kurtz said on Friday that the issues were caused by a “defect” in code the company released for Windows. Mac and Linux systems are not affected. “The issue has been identified, isolated and a fix has been deployed,” Kurtz said in a statement, adding the problems were not the result of a cyberattack. In an interview with NBC, Kurtz apologized for the disruption and said it may take some time for things to be back to normal.

Security and IT analysts searching for the root cause of the gargantuan outage say that it appears to be related to a “kernel driver” update to CrowdStrike’s Falcon software. Kernel drivers are the software components that allow applications to interact with Windows at its deepest level, the core of the operating system known as its kernel. That highly sensitive level of access is necessary for security software, so that it can run prior to any malicious software installed on the system and access any part of the system where hackers might seek to plant their code. As malware has improved and evolved, it has pushed defense software to require constant connection and more extensive control.

That deeper access also introduces a far higher possibility that security software—and updates to that software—will crash the whole system, says Matthieu Suiche, head of detection engineering at the security firm Magnet Forensics. He compares running malicious code detection software at the kernel level of an operating system to “open-heart surgery.”

Yet it’s nonetheless surprising that a kernel driver update would be able to cause such a massive global computer crash, says Costin Raiu, who worked at Russian security software firm Kaspersky for 23 years and led its threat intelligence team before leaving the company last year. During his years at Kaspersky, he says, driver updates for Windows software were closely scrutinized and tested for weeks before they were pushed out.

More importantly, they require that Microsoft also vet the code and cryptographically sign it, suggesting that Microsoft, too, may well have missed whatever bug in CrowdStrike’s Falcon driver triggered this outage. “It’s surprising that with the extreme attention paid to driver updates, this still happened,” says Raiu, “One simple driver can bring down everything. Which is what we saw here.”

Microsoft did not return requests for comment about update oversight and whether the Azure outage and CrowdStrike situation have any connection. However, a Microsoft spokesperson says the “CrowdStrike update was responsible for bringing down a number of IT systems globally.”

Raiu adds that even so, CrowdStrike is far from the only security firm to trigger Windows crashes with a driver update. Updates to Kaspersky and even Windows’ own built-in antivirus software Windows Defender have caused similar “Blue Screen of Death” crashes in years past, he notes. “Every security solution on the planet has had their CrowdStrike moments,” Raiu says. “This is nothing new but the scale of the event.”

Cybersecurity authorities around the world have issued alerts about the disruption, but have similarly been quick to rule out any nefarious activity by hackers. “The NCSC assesses that these have not been caused by malicious cyber attacks,” Felicity Oswald, the CEO of the UK’s National Cyber Security Center said. Officials in Australia have come to the same conclusion.

Nevertheless, the impact has been sweeping and dramatic. Around the world, the outages have been spiraling as companies, public bodies, and IT teams race to fix bricked machines, which involves manually taking machines through a series of corrective steps including rebooting. In the UK, Israel, and Germany, healthcare services and hospitals saw systems they use to communicate with patients disrupted and canceled some appointments. Emergency services in the US using 911 have reportedly had problems with their lines too. In the earliest hours of the outages, some TV stations, including Sky News in the UK, stopped live new broadcasts.

Global air travel has been one of the most impacted sectors so far. Huge lines formed at airports around the world, with one airport in India using handwritten boarding passes. In the US, Delta, United, and American Airlines grounded all flights at least temporarily, with a dramatic graphic showing air traffic plummeting above the US.

The catastrophic situation reflects the fragility and deep interconnectedness of the internet. Numerous security practitioners told WIRED that they anticipated or even worked with clients to attempt to protect against a scenario where defense software itself caused cascading failures as a result of malicious exploitation or human error, as is the case with Cloudstrike. “This is an incredibly powerful illustration of our global digital vulnerabilities and the fragility of core internet infrastructure,” says Ciaran Martin, a professor at the University of Oxford and the former head of the UK’s National Cyber Security Centre.

The ability of one update to trigger such massive disruption still puzzles Raiu. According to Gartner, a market research firm, CrowdStrike accounts for 14 percent of the security software market by revenue, meaning its software is on a wide array of systems. Raiu suggests that the Falcon update must have triggered crashes at cloud providers such as Azure and Amazon Web Services, which vastly multiplied the disaster. “CrowdStrike is big, but it can’t be this big,” Raiu says. “Airports, critical infrastructure, hospitals. It cannot be just CrowdStrike everywhere. I suspect we’re seeing a combination of factors, a cascading effect, a chain reaction.”

Hyppönen, from WithSecure, says his “guess” is that the issues may have happened due to “human error” in the update process. “An engineer at CloudStrike is having a really bad day,” he says. Hyppönen suggests that CrowdStrike could have shipped software different to what they have been testing, mixed up files, or there could be a combination of different factors. “Software like this has to go through extensive testing,” Hyppönen says. “That’s what we do. That’s what CrowdStrike, of course, does. You have to be really careful about what you ship, which is tough to do because security software is updated very frequently.”

While many of the impacts of the outage are ongoing and still unraveling, the nature of the problem means that individually impacted machines may need to be rebooted manually, rather than through an automated process. “It could be some time for some systems that just automatically won’t recover,” CrowdStrike CEO Kurtz told NBC.

The company’s initial “workaround” guidance for dealing with the incident says Windows machines should be booted in a safe mode, a specific file should be deleted, and then rebooted. “The fixes we’ve seen so far mean that you have to physically go to every machine, which will take days, because it’s millions of machines around the world which are having the problem right now,” says Hyppönen from WithSecure.

As system administrators race to contain the fallout, the larger existential question of how to prevent another, similar crisis looms large.

“People may now demand changes in this operating model,” says Jake Williams, vice president of research and development at the cybersecurity consultancy Hunter Strategy. “For better or worse, CrowdStrike has just shown why pushing updates without IT intervention is unsustainable.”