What caused the global CrowdStrike system failure?

The incident and its root cause

What caused the global CrowdStrike system failure

On July 19, 2024, cybersecurity company CrowdStrike faced a major crisis when an update for its Falcon Sensor software caused a global system crash. The issue termed the “Channel File 291 Incident,”[1] affected over 8.5 million Windows devices worldwide.[2]

According to CrowdStrike’s detailed root cause analysis,[3] the problem arose due to a content validation error during a software update. This new update introduced a Template Type that aimed to detect new attack methods.

However, a mismatch between the expected and actual input parameters led to an out-of-bounds memory read, causing widespread system failures. According to Crowdstrike:[4]

On July 19, 2024, a Rapid Response Content update was delivered to certain Windows hosts, evolving the new capability first released in February 2024. The sensor expected 20 input fields, while the update provided 21 input fields. In this instance, the mismatch resulted in an out-of-bounds memory read, causing a system crash. Our analysis, together with a third-party review, confirmed this bug is not exploitable by a threat actor.

Technical details and impact

The crash occurred because the update expected 21 input parameters, but only 20 were provided. This discrepancy caused the system to attempt to read a nonexistent value, resulting in a crash. The new version of Channel File 291 was the first to utilize this 21st input, highlighting a gap in the testing process.

The company’s existing quality assurance procedures failed to catch this error due to the use of wildcard matching in tests, which did not account for this specific case.

The fallout was significant, with CrowdStrike’s stock prices dropping and several major clients, including Delta Air Lines, considering legal action due to the substantial disruptions caused. The incident led to an estimated $500 million in losses for Delta, alongside many other impacted organizations worldwide.[5]

CrowdStrike took steps to prevent future incidents

In response to the incident, CrowdStrike has implemented several measures to ensure such an event does not recur. They have updated the Falcon platform to include more rigorous input validation checks and enhanced their testing processes.

The company has added runtime input array bounds checks to prevent out-of-bounds memory reads and improved the deployment process of their updates. Furthermore, CrowdStrike has engaged two independent third-party security vendors to review the Falcon sensor code and their overall quality assurance practices. These steps are aimed at bolstering the system’s resilience and preventing similar issues in the future.

CrowdStrike has committed to increasing test coverage during Template Type development to include test cases for non-wildcard matching criteria for each field. This will ensure comprehensive validation before deployment.

Additionally, the Content Validator will be updated to add new checks, ensuring content in Template Instances does not exceed the provided input fields. CrowdStrike also plans to provide customers with increased control over the delivery of Rapid Response Content, allowing staged deployments to mitigate risks.

To further ensure the robustness of their systems, CrowdStrike has engaged two independent third-party software security vendors to conduct extensive reviews of the Falcon sensor code for both security and quality assurance.

The company is also working with Microsoft to enhance security functions in user space, reducing reliance on kernel drivers. This collaboration aims to improve the overall stability and security of Windows systems integrated with CrowdStrike’s solutions.

Industry and regulatory response

The global outage had far-reaching effects, prompting regulatory and industry scrutiny. The Electronic Frontier Foundation called for tougher antitrust enforcement, emphasizing the dangers of digital monocultures and the need for more stable and secure digital infrastructures. CrowdStrike’s CEO, George Kurtz, has been called to testify before the US Congress to explain the incident and outline the steps taken to prevent future occurrences.

The aftermath of the incident has led to significant financial and reputational damage for CrowdStrike. The company remains committed to learning from this event and strengthening its systems to provide better service and security to its customers.

About the author
Gabriel E. Hall
Gabriel E. Hall - Passionate web researcher

Gabriel E. Hall is a passionate malware researcher who has been working for 2-spyware for almost a decade.

Contact Gabriel E. Hall
About the company Esolutions

References
Files
Software
Compare