I'll admit that seeing they are single-homed is sketchier than I assumed Fastmail's infrastructure was. I've been using Fastmail for years and I like them, but they are clearly big enough to have a second transit provider, and have been for many years. I'm amazed it took an outage for them to decide to get one. I appreciate the post-mortem but I felt better before I had read it.
I'm surprised they don't own their own IPs. In the email world I would say that's quite important. Seems a tad casual to say "luckily they are willing to lease them"...
I'm not clear from the post-mortem why the outbound packets were having issues. Was it cloudflare? Did someone accidentally delete an outbound route? Why couldn't they see the issue themselves? I only have more questions now.
Crazy that they were single homed.
Edit: After seeing the network diagram I have even more questions. What happens if CF is down? This all seems cobbled together and very prone to failures.
I don't know if IP range reputation is part of SMTP but I wish sending email worked without it.
Redundancy is the name of the game. Glad they realize this.
NYI. Good people.
Am I the only one who thinks that this post-mortem for a margin of email users worldwide is a pure marketing gimmick to attract more "technical" users?
Netgear switches? In an environment like this? I’ll give them the benefit of the doubt that that is maybe a provider-owned thing, and that they have an 'enterprise' line, but, really... Netgear. The firewall brand isn’t revealed in the network diagram, but what is it, a $100 sonicwall? Should I be concerned keeping all my email, business and personal, there about what other parts of their infrastructure they are cheaping out on?
When you are running a service like this, redundancy among transit providers is the most basic, table-stakes thing you can do. It's almost negligent to not have that.
Incidents happen, that's life. You can hedge your bets, but some things are out of your control. Communicating with your customers however is entirely within your control. Fastmail did a poor job of it. Their status page was useless beyond an initial "we found an issue" and then nothing for almost 11hrs. Their Twitter account was the same story, didn't bother with the Mastodon account at all. Unfortunately they don't seem to realise or recognise that they dropped the ball on this and that goes entirely unaddressed.
I'm also not really charmed with how they try to minimise the importance of the incident by repeating it only affected 3-5% of the customers. That might very well be. But those are real people and real businesses that rely on your services that were unavailable for the whole of the EU workday and a significant part of the US workday. Everyone I know who was affected is a paying customer, none of us have received so much as a communication or apology for it.
For a company that's been on the internet since 1999, the single-homed setup is a little shocking. But fine, it's being addressed. But both the communication during and after the incident don't inspire a ton of confidence.