Summary:

The CrowdStrike outage on July 19, 2024, triggered a massive global IT disruption, affecting approximately 8.5 million Microsoft Windows systems worldwide. The incident, caused by a faulty configuration update to CrowdStrike’s Falcon Sensor security software, resulted in system crashes and boot failures across numerous industries and government services. The outage’s impact was exacerbated by concurrent issues with Microsoft’s Azure platform, leading to widespread service interruptions in critical infrastructure such as aviation, banking, healthcare, and emergency services. With estimated financial losses of at least $10 billion globally, this event has been dubbed the largest IT outage in history. This blog post provides a detailed overview of the incident, its impact, and the lessons learned.

CrowdStrike outage caused Blue Screen of Death for many users

Who is CrowdStrike and what do they do?

CrowdStrike is an American cybersecurity company based in Austin, Texas. CrowdStrike’s core technology, the Falcon platform, offers advanced endpoint security, leveraging artificial intelligence and behavioral analytics to detect and prevent cyberattacks in real-time.

What caused the CrowdStrike outage?

The global outage stemmed from CrowdStrike’s Falcon security software, which operates at a deep level within the Windows operating system. The Falcon platform integrates tightly with the Windows kernel to provide endpoint security. On July 19, 2024, CrowdStrike released a faulty configuration update for its Falcon sensor software running on Windows PCs and servers.

The faulty software update caused an out-of-bounds memory read in the Windows sensor client, resulting in an invalid page fault or a Blue Screen of Death (BSoD). Specifically, the operating system crash was due to a read-out-of-bounds memory safety error in CrowdStrike’s CSagent.sys driver, which is registered as a file system filter driver to receive notifications about file operations.

As a result, Windows systems either entered a boot loop or booted into recovery mode. The deep integration of CrowdStrike’s software with the Windows kernel meant that when the CrowdStrike update failed, it caused widespread system crashes, affecting an estimated 8.5 million Windows devices.

As these systems crashed due to the CrowdStrike issue, it led to operational disruptions in Microsoft’s cloud infrastructure. It’s worth noting that Microsoft was not responsible for the initial problem but was affected due to the widespread use of its operating system and cloud services.

Protect Exchange Server

Anatomy of the CrowdStrike Sensor Update Failure

On July 19, 2024, at 04:09 UTC, CrowdStrike released a sensor configuration update (Channel File 291) that contained a logic error, triggering system crashes on affected machines. By 05:27 UTC, CrowdStrike had identified the issue and reverted the changes, but the damage was already widespread.

Specifically:

  1. Logic error in Channel File: CrowdStrike released a sensor configuration update (Channel File 291) on July 19, 2024, at 04:09 UTC. This file contained a logic error that caused system crashes when processed by the Falcon sensor.
  2. Named pipe execution evaluation: Channel File 291 controls how the Falcon sensor evaluates named pipe execution on Windows systems. The update was intended to target newly observed malicious named pipes used in cyberattacks, but the logic error in the file caused unintended behavior.
  3. Kernel-level crash: Although Channel Files are not kernel drivers themselves, they are processed by CrowdStrike’s kernel-level code. When the faulty file was processed, it caused the highly trusted operating-system-level code to malfunction, bringing down the entire Windows system.
  4. Uninitialized data issue: Analysis of crash dumps and disassembly suggested that the crash arose from attempting to use uninitialized data as a pointer – a “wild pointer” – leading to the system crash.
  5. BSOD error messages: Users reported seeing BSOD errors with messages such as “DRIVER_OVERRAN_STACK_BUFFER” and “SYSTEM_THREAD_EXCEPTION_NOT_HANDLED”.
  6. Reboot loop: Upon rebooting an affected system, it would almost immediately start up the Falcon sensor and crash again, creating a loop of BSODs.

This technical issue affected Windows systems running Falcon sensor version 7.11 and above that downloaded the updated configuration between 04:09 UTC and 05:27 UTC on July 19, 2024

 Technical Resolution Steps

Both CrowdStrike and Microsoft took immediate action to address the issue:

  1. CrowdStrike quickly identified and deployed a fix for the faulty update.
  2. Microsoft deployed hundreds of engineers to work directly with customers on restoring services.
  3. The companies collaborated to develop a scalable solution to accelerate the fix for the faulty update.
  4. Microsoft provided manual remediation documentation and scripts for affected systems.

Challenges in System Restoration

The recovery process faced several challenges:

  1. Manual intervention required. The fix often required painstaking manual work. System administrators had to individually access each affected device, initiate safe mode, and manually remove the problematic CrowdStrike file. This made recovery time-consuming for large organizations.
  2. Cloud environments. AWS, Azure, and GCP presented unique challenges compared to on-premises systems, as they don’t support conventional recovery methods like “safe mode.”
  3. BitLocker Encryption. Many organizations had encrypted their computer drives for security reasons, adding an extra layer of complexity to locating and deleting the problematic file. BitLocker for example, Microsoft’s disk encryption technology, complicated recovery efforts by requiring access to the BitLocker Recovery Key to manage disks securely.
  4. Coordination across multiple parties. Microsoft collaborated closely with CrowdStrike, other cloud providers like Google Cloud Platform (GCP) and Amazon Web Services (AWS), and affected customers to develop and implement solutions.

Global Impact

The outage caused massive disruptions across multiple countries and industries. Approximately 8.5 million Windows devices were directly affected by the CrowdStrike logic error flaw, representing less than 1% of Microsoft’s global Windows install base. However, the systems impacted were critical to many operations, leading to widespread disruptions, including airlines, healthcare, and financial services.

Critical services affected:

Financial Services

Financial institutions were significantly impacted due to its heavy reliance on IT infrastructure:

  • Banks: Major banks reported service disruptions, affecting online banking, ATM services, and internal operations.
  • Stock Exchanges: The London Stock Exchange Group faced an outage on its workspace platform, preventing the publication of statements.
  • Payment Systems: Payment terminals in Australia were affected, disrupting transactions.

Transportation

The transportation sector experienced severe disruptions:

  • Airlines: Major carriers including American Airlines, United, and Delta Airlines were forced to ground flights. International airlines like Air India and KLM also reported disruptions.
  • Airports: Hong Kong International Airport, Berlin Brandenburg Airport, and London Stansted faced operational issues.
  • Public Transit: Systems in the Northeast United States, including Washington, DC, and New York City, experienced delays.

Healthcare

The healthcare sector faced critical challenges:

  • Hospitals: Many hospitals across nations announced impacts on services, delays, and cancellations of non-urgent medical procedures and appointments.
  • Blood Supply Organizations: Faced challenges in distributing blood to hospitals, causing delays in reporting test results and impacting planned shipments.

Emergency Services

Critical emergency services were affected:

  • 911 Centers: Several states in the US reported outages and technological challenges in their 911 centers.
  • Emergency Dispatch: The Computerized Dispatch systems of emergency services were impacted.

Government Services

Various government agencies and services were disrupted:

  • Federal Agencies: The Department of Homeland Security, Department of Justice, and Social Security offices in the US faced service disruptions and longer wait times for assistance.
  • Government Websites: Many government websites and online services were temporarily inaccessible.

Economic Impact

The financial consequences of this outage were substantial. Parametrix, an insurance services company, estimates that Fortune 500 companies may face losses exceeding $5 billion due to this incident.

  • Fortune 500 companies:
    • Total estimated losses of $5.4 billion in revenues and gross profit.
    • The health care sector took the hardest hit, with estimated losses of $1.94 billion.
    • The banking sector faced losses of approximately $1.15 billion.
    • Airlines collectively lost around $860 million.
  • Insurance coverage:
    • Only 10% to 20% of the losses are likely covered by cybersecurity insurance policies.
  • Specific company impacts:
  • Broader economic impact:
    • The worldwide financial damage is estimated to be at least $10 billion.

Leadership Response

CrowdStrike CEO George Kurtz issued a public apology on LinkedIn, stating, “We’re deeply sorry for the impact that we’ve caused to customers, to travelers, to anyone affected by this, including our company.” He assured that CrowdStrike was working diligently to restore all affected customer systems.

Microsoft’s leadership also responded promptly, with David Weston, Vice President of Enterprise and OS Security, detailing the company’s efforts to support customers through the crisis.

What’s Microsoft’s plan to prevent a similar outage?

Microsoft is taking several steps to prevent future incidents similar to the CrowdStrike outage. The company is actively engaging with third-party software security vendors through the Microsoft Virus Initiative (MVI) to share data and best practices. This collaboration aims to improve the overall security ecosystem and reduce the risk of similar incidents in the future.

Additionally, Microsoft is advising security software vendors to minimize their use of kernel mode sensors for data collection and enforcement. Instead, they recommend isolating the majority of key product functionality in user mode, where additional protections such as Virtualization-based Security (VBS) Enclaves, Protected Processes, and Event Tracing for Windows (ETW) are available.

Microsoft is also planning to work closely with cybersecurity vendors to help them leverage integrated Windows security features more effectively. By encouraging the use of these built-in security capabilities, Microsoft aims to enhance system stability and reduce the likelihood of kernel-level crashes that could lead to widespread outages.

Key Takeaways

  1. The CrowdStrike outage was caused by a memory safety issue in their CSagent.sys driver, which performed a read-out-of-bounds access violation.
  2. Kernel-level access for security products is necessary for system-wide visibility, early threat detection, better performance, and tamper resistance.
  3. However, kernel drivers running at the most trusted level of Windows increase risks and can lead to widespread system crashes when critical issues occur.
  4. The incident highlighted the need for a balance between security product capabilities and the risks associated with kernel-mode operations.
  5. Collaboration between tech companies (Microsoft and CrowdStrike) was crucial in resolving the widespread issues and restoring affected systems.

Conclusion

The CrowdStrike outage of 2024 serves as a stark reminder of the vulnerabilities in our digital infrastructure and the far-reaching consequences of software failures. It underscores the need for improvement in update deployment practices, disaster recovery planning, and cross-industry collaboration. As the tech industry moves forward, prioritizing these areas will be crucial in preventing and mitigating similar incidents.

Further reading:

CrowdStrike Falcon root cause analysis (RCA) report

Microsoft’s technical overview of the CrowdStrike outage

Reach out to Messageware to improve Microsoft Exchange Server Security

If you are not protecting all the protocols used by your Exchange Server, you’re putting your company at a higher risk of a data breach.

Security incidents happen frequently. They cause disruption, loss of data and potentially risk the reputation of your company. However, if you implement these steps, you’re doing more than most other companies.

Have you heard about Messageware’s EPG that offers advanced Exchange Server security to protect organizations from a variety of logon and password attacks, as well as extensive real-time reporting and alerts of suspicious logon activity? Learn more about Messageware’s Microsoft Exchange Server security products.