CrowdStrike Official RCA is now out [pdf]

  • > In summary, it was the confluence of these issues that resulted in a system crash: [...] the lack of a specific test for non-wildcard matching criteria in the 21st field.

    I feel they focus a lot on their content validator lacking a check to catch this specific error (probably since that sounds like a more understandable oversight) when the more glaring issue is that they didn't try actually running this template instance on even a single machine, which would've instantly revealed the issue.

    Even for amateur software with no unit/integration tests, the developer will still have typically ran it on their own machine to see it working. Here CrowdStrike seem to have been flying blind, just praying new template instances work if they pass the validation checks.

    They do at least promise to "ensure that every new Template Instance is tested" further down.

  • It doesn't even cover the barest of organisational root cause. How are they planning to do defense in depth and prevent any internal threat actor from wedging every machine in the world?

  • That's a lot of words to say "We did not test a file that gets ingested by a kernel level program, not even once"

    At no point did they deploy this file to a computer they owned and attempted to boot it. They purposely decided to deploy behavior to every computer they could without even once making sure it wouldn't break from something stupid.

    Are these people fucking nuts?

    I do more testing than this and I might be incompetent. Also nothing I touch will kill millions of PCs. I get having pressure put on you from above, I get being encouraged to cut corners so some shithead can check off a box on his yearly review and make more money while stiffing you on your raise, I get making mistakes.

    But like, fuck man, come on.

  • They should've read "parse, not validate": https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

  • A lot of mitigation actions but nothing to really stop it happening again: a fail safe system in their boot start driver. Bad programming and QA caused the issue, but bad design allowed it to happen

  • Add a new threat actor to the list, those pesky parameter counts actively trying to evade detection:

    "This parameter count mismatch evaded multiple layers of build validation and testing, as it was not discovered during the sensor release testing process, the Template Type (using a test Template Instance) stress testing or the first several successful deployments of IPC Template Instances in the field."

    Curious that csagent.sys isn't mentioned until last page, p. 12:

    "csagent.sys is CrowdStrike’s file system filter driver, a type of kernel driver that registers with components of the Windows operating system…"

  • Well I guess I should post the obligatory

    > Some people, when confronted with a problem, think

    > “I know, I’ll use regular expressions.”

    > Now they have two problems.

  • Note: this was distributed to their customers today

  • Is it just me or does it seem like this change simply wasn't tested beyond a simple unit test?

  • kinda sounds like this was a regex bug?

    > The selection of data in the channel file was done manually and included a regex wildcard matching criterion in the 21st field for all Template Instances, meaning that execution of these tests during development and release builds did not expose the latent out-of-bounds read in the Content Interpreter when provided with 20 rather than 21 inputs.

  • [flagged]