Why Hasn't Twitter Crashed Yet?

  • Halfway decent systems "just work".

    Back in the 1990s my wife worked at the ag school and they had a moment of panic when they realized they had no idea where the web server was. Turned out they had a tiny little HP PA/RISC machine in a closet covered in dust bunnies that had been running for two years without anybody thinking about it.

    Last night I wanted to create a webhook and decided to use AWS Lambda. I have a few things in AWS including Lambda functions. I figured I'd look at my old ones as a reference for my new one, but I was shocked to realize I had things that had been running for five years without any intervention at all.

    In both of my cases you have middling software and negligent management but the underlying hardware or services are reliable and high quality. It's not like the entrepreneur I knew who was always finding web hosting that was a lot cheaper than anyone else with the downside that every few months we had to move to another data center in a hurry.

  • Failure is proportional to change.

    A growing company is frequently changing. A company that launches new features is changing. A company trying to fix architecture is changing. The large work forces a lot of valley companies have is built around and justified by this growth/change.

    The change that twitter will likely experience now is machine failure (3/1000 a day probably), hard drive expiration, potentially database promotions. Failures of cache machines.

    Automation can drive a lot of these to very small workloads, but capacity management is a potentially existential crisis looming over all tech companies.

    Then you get to the real problem that twitter faces. Political change, security change, and workforce rot.

    Political/regulatory change poses a problem because it often requires changes to infrastructure. This creates the type of change that can result in failure.

    Security change can be supply chain problems or bug reports. Maybe keys need to get rotated, new encryption added, software updated. All of these are change. All can result in failure, and potentially catastrophic failure.

    Lastly, the largest existential problem is that the engineers left at twitter are likely not their best and many of them are probably coerced into staying due to H1B regulation. Now you run into a problem of attrition and replacing that attrition. When your good engineers leave (or are over worked), it's harder to hire good engineers. The difference between a good engineer and a bad engineer is their `complexity to result` ratio. Good engineers can create simple solutions, while bad engineers create complex solutions, even though both might produce the same end result.

    Failure is also proportional to complexity and outage duration is most impacted by complexity.

  • I don't think any sensible people thought Twitter was kept alive by huge amounts of intervention. You don't build systems like that. I had a side project which ran for a decade during which I barely looked at it. Because I built it that way.

    However people things like moderation, sales, etc are a different issue. Degradation, if any, is likely to be in quality, not system crashes.

  • Because 80% of the company was cruft. As it is in most large orgs.

  • From what I recall from various blog posts by Twitter staff over the years, Twitter is incredibly lean and terse, and does one thing well: micro blogging. It gets into trouble when serving long form video, which Musk is trying to change with 30min / longer videos or otherwise trying to provide rich media at scale, which is a hard problem. They can throw CDNs at the problem, but that gets expensive, fast, not to mention Twitter video content is compressed almost to the point of being unwatchable. And as for increasing the character count: I welcome that, but rich media is the real problem, not longform text.

  • Elon M has been removing the barnacles stuck on the Twitter boat. The boat will go fast without them.

    Unlike your Volkswagen CEO with a dim view of software in the past, Elon M understands software and packet switching at millisecond resolution and has demonstrated experience debugging complexity at that scale equiping him to make informed and impactful decisions.

    Less is more.

  • Same reason there’s a code freeze around Christmas. Fewer changes means fewer chances to mess things up.

    If you fire 80% of the operations staff you better fire 80% of the eng staff too.

    As changes get made and released I expect to see some more outages.

    Also good staff made the plans and laid the groundwork for resiliency. That work continues to provide benefits after they’ve gone.

  • Because you don't need 7,000 people to run the bird site when at maximum 1,000 or less than 700 people can do that.

    It has run absolutely fine since then despite the psychological panic and unfounded prophesies of immediate total collapse being spread by emotional doomsters still crying on Twitter.

    So it was all supposed to collapse any minute nowâ„¢ faster than a lettuce [0]. Well I guess it is time for the doomsters to admit that they fell for their own FUD and Twitter essentially just overhired.

    [0] https://twitter.com/Foone/status/1599920814773919745