Last Friday, the world was in chaos. Millions of windows computers suddenly turned blue screen and stopped working. The outage crippled many airports, banks, hospitals and critical infrastructure around the world. It was complete madness. People thought it was some kind of digital terrorist attack. It might as well be but we don’t have all the details. What people quickly figured out is that the culprit is a corrupt CrowdStrike system file!! Apparently, a malformed system file is pushed to millions of Windows computers through a security update. The Windows operating system detected the new system file, tried to reboot but failed and everything just stopped working. It’s kind of insane that a third party company (CrowdStrike) can just remotely push out files to Windows systems without meticulous testing. CrowdStrike stock was deservably in free fall after the outage. This outage makes me wonder how fragile our global IT infrastructure really is. If a third party vendor like CrowdStrike can wreak havoc on the global IT system by simply pushing out a bad update, what prevents Russian or North Korean hackers from causing enormous damages by infiltrating these vendors and planting some malware in curious places? Here is a super depressing comment in a hacker news thread:
CrowdStrike in this context is a NT kernel loadable module (a .sys file) which does syscall level interception and logs then to a separate process on the machine. It can also STOP syscalls from working if they are trying to connect out to other nodes and accessing files they shouldn't be (using some drunk ass heuristics).
What happened here was they pushed a new kernel driver out to every client without authorization to fix an issue with slowness and latency that was in the previous Falcon sensor product. They have a staging system which is supposed to give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.
This has taken us out and we have 30 people currently doing recovery and DR. Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver. We have to literally take each node down, attach the disk to a working node, delete the .sys file and bring it up. Either that or bring up a new node entirely from a snapshot.
This is fine but EC2 is rammed with people doing this now so it's taking forever. Storage latency is through the roof.
I fought for months to keep this shit out of production because of this reason. I am now busy but vindicated.
Edit: to all the people moaning about windows, we've had no problems with Windows. This is not a windows issue. This is a third party security vendor shitting in the kernel.
People can follow instructions as shown above to get their personal computer working again but applying the fix to a huge server farm with tens of thousands of computers will be a huge challenge as mentioned above. This is bonkers and I hope this doesn’t give malicious people ideas on how they could take over the world through a third party security update.