Thoughts on the Crowdstrike Outage
Today, 19th July 2024, will be remembered as the day of the so-called ’largest global IT outage’ in recent history.
Crowdstrike, a leading cybersecurity company, released an update for its Falcon Sensor software, a component of its cloud-based endpoint protection solution designed to assist in Real-time Threat Detection and Incident Response. This update was at the heart of the incident.
Endpoint security software updating itself is nothing new, but this update contained a problematic channel file, which caused Windows endpoints that installed this update to run into a blue screen of death (BSoD), making the endpoint inoperable.
Because these kind of security platforms are designed to distribute the latest security and antivirus updates as soon as possible, allowing endpoints to rapidly respond to newly discovered threats, this update was very quickly installed on many endpoints across the globe. This lead to some major service outages across the globe, including card payments not being processed, flights and trains being cancelled, emergency departments being unable to function, and so on. This lead to calls for emergency government task force meetings to be scheduled. All because of a faulty file in a update.
A workaround soon made the rounds online. It involved rebooting the affected endpoints into safe mode and deleting the faulty file. Crowdstrike pulled the update containing the defective file and ‘published a fix’, which obviously did not fix anything on its own because the affected machines were still stuck with BSoDs.
It almost goes without saying that Crowdstrike made a mistake. Was this update not tested in a controlled environment before being released to the public? Are these updates not released in a staged manner, first to a small number of endpoints, then to a larger number, then to all endpoints? Were these processes not in place, or did they fail?
In addition to the large number of critical services being down, the amount of incorrect and inaccurate information being spread on this subject was of concern. Throughout the day, news coverage fixated on this topic. Although some experts they interviewed did, in fact, understand what they were talking about, many didn’t. Sky News, for example, posted an article questioning why the UK’s prime minister was too distracted by other events to react to the crisis. I can only imagine the number of people who read this and thought the prime minister had some control over the situation.
Another example of inaccurate information being reported is one of the BBC’s cyber correspondent’s posts to the live news event on their website, which stated:
‘The irony here, of course, is that Crowdstrike is a cyber-security product, so designed to protect computer networks from outages.’
This statement demonstrates the correspondent’s misunderstanding of what cyber security product is affected, what it is meant to do, and what the actual issue is.
Do they think networks went down, causing the issue? Do they understand that this security software is designed to protect endpoints from malware and viruses, not to ensure service availability? Do they understand the issue was caused by a faulty update, not a cyber attack?
The same correspondent then followed up with another post, stating
‘It appears that the so-called “Blue Screen of Death” that computers are suffering means that each one needs to get “hands on keyboards treatment”.
That is, it appears to be not something that can be fixed with a central command from an IT administrator in a firm’s HQ. They will need to go and reboot each and every computer.’
There are several issues with this post.
Firstly, a reboot itself will not fix this issue. I can only imagine the number of people who read this, sitting at their desks with their computer displaying a BSoD and attempting to reboot their machine, only to realise nothing had changed.
Secondly, these inaccessible critical services are hosted on primarily virtualised servers, not installed on bare metal. Therefore, you can actually remedy this part of the issue from a desk in an IT department without physically interacting with each endpoint.
The third issue is that there actually is an alternative way to resolve this issue that I did not see mentioned online. This was to restore from a backup. Configured correctly, Veeam is an example of a fantastic product for backing up and restoring virtual servers that would have been able to help restore critical services quickly. I assume this is what some engineers chose to do to remedy the issue. After years of crypto locker attacks, organisations should have a backup and restore process in place for their servers, where backups are stored offsite and offline.
The widespread reporting of inaccurate information on this incident is concerning. It highlights the need for better education on IT infrastructure and cybersecurity, not just for the general public but also for journalists and news correspondents.
The final point I want to make is that this incident highlights the dangers of relying on a single point of failure. Choosing to use a product like CrowdStrike Falcon and then configuring it to instantly update on your production servers and user workstations makes CrowdStrike Falcon a single point of failure for your whole IT infrastructure, with no easy way to mitigate any faulty updates, such as the one released today. The decision to update all of your endpoints as soon as possible is the decision that incidents such as today’s are an acceptable risk.
Of course, you could have no endpoint security solution, but that is an even worse decision than accepting Crowdstrike as a single point of failure.
There is a middle ground, though, such as staggering the release of updates to test servers and a small batch of user workstations and then releasing updates to your remaining endpoints after some time. This allows you time to detect any issues with the test endpoints and halt any updates to the remaining endpoints. Your endpoint security software may not be updated instantly. Still, you feel more assured that it will not cause any issues when updated on the servers hosting your critical services.
While it’s tempting to assign all the blame to Crowdstrike for the outage, and they should absolutely take their fair share, an issue still lies in the broader management of cybersecurity. Organisations must prioritise robust security measures over mere compliance, testing updates in controlled environments before broad deployment. Regulators also need to ensure their frameworks do more good than harm. Let’s use this incident as a catalyst for building more resilient and secure IT infrastructures.