Gotta love how painfully vague this is. Sounds like a PR piece for investors, not an engineering blog piece.
Even though the angle grinder story wasn’t accurate, it’d still be interesting to know what percentage of the time to fix the outage was spent on regaining physical access:
https://mobile.twitter.com/mikeisaac/status/1445196576956162...
I worked with a network engineer who misconfigured a router that was connecting a bank to it's DR site. The engineer had to drive across town to manually patch into the router to fix it.
DR downtime was about an hour, but the bank fired him anyway.
Given that Zuck lost a substantial amount of money, I wonder if the engineer faced any ramifications.
Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.
This is a funny post to have suggested at the bottom of the article: https://engineering.fb.com/2021/08/09/connectivity/backbone-...
So this is pure conspiracy theory, but to me this could be a security issue. What if something deep in the core of your infrastructure is compromised? Everything at risk? Id ask my best engineer, hed suggest to shut it down, and the best way to do that is to literally pull the plug on what makes you public. Tell everyone we accidentally messed up a BGP and thats it.
But yeah, likely not.
It was interesting to visit the subreddits of random countries (eg /r/Mongolia) and see the top posts all asking if fb/Insta/WhatsApp being down was local or global. I got the impression this morning that it was only affecting NA and Europe, but it looks like it was totally global. The numbers must be staggering of the number of people trying to login.
This more or less confirms what we’ve heard, and I appreciate the speed, but it’s incredibly lame from a details point of view.
Will a real postmortem follow? Or is this the best we are gonna get?
Sounds like they could do with some updates to their risk-driven backbone management strategy!
https://engineering.fb.com/2021/08/09/connectivity/backbone-...
The badge story only shows how people are looking for "efficiency" where it doesn't matter, with predictable results.
The badge system should be local to the building. There are few actual reasons (sure, besides "efficiency") of why badge control should be centralized. Even less reasons for it to be a subdomain of fb. Another option would be to keep the system but make it failsafe (but it seems the newer generation doesn't know what that means). If the network goes down keep it at the last config. Badge validation should be offline first and added/removed ones should be broadcast periodically.
This is the same issue with smartlocks times the number of employees. Do you really want to add another point of failure between yourself and your home?
"To all the people and businesses around the world who depend on us, " ... yesterday was another example of why you shouldn't depend on us to such an extent.
On a side note: when I browse to that page in Firefox (92.0.1) from HN I can't go back to HN - the back arrow is disabled. What gives?
Well that doesn't say a whole lot... I know it is early but they could use a little more detail. Even if it is just a timeline.
It was quite ironic that while every Facebook property was offline there was an immense amount of misinformation about the incident perpetuated across the internet (including right here on HN) which everyone just believed as fact.
>We also have no evidence that user data was compromised as a result of this downtime.
No, that just happens during uptime.
> configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues
This could be anything, potentially.
I'm not very knowledgeable in computer networking, but this could be as trivial as an incorrect update to a DNS record, right?
It would be interesting to estimate what dollar value can be ascribed to a x-hour FB outage, both in terms of lost ad revenue for FB itself as well missed conversions/revenue for businesses running ads on FB/IG.
Knowing almost nothing about networking, isn't the way Facebook handles networking somewhat of a monolithic anti-pattern? Why is a single update responsible for taking out multiple services and why wouldn't each product or even each region within each product have their own routes, for resiliency which can then be used to rollout changes slower?
By having a large centralized and monolithic system, aren't they guaranteeing that mistakes cause huge splash damage and don't separate concerns?
Just out of curiosity, does Facebook have a status page? Like http://status.twitter.com?
The first thing people here thought of was that it was the gouvernement denying access to these websites as it usually does for a number of reasons.
Around the turn of the century, in a network the size of Europe, we had OOB comms to the core routers via ISDN/POTS. We experimented with mobile phones in the racks as well, much to the chagrin of the old telco guys running the PoPs.
The mobile whatsapp app should notify that the whatsapp servers are down and not allow you to just send messages that won't arrive for six hours
Any FB throwaway know if someone got fired for this?
Why is this non-post on the frontage? It's PR only
Move fast and
NO CARRIER
So their actual deployment process is quite rigorous and should have a tight blast radius. After lots of emulated and canary testing, their deployments are phased out over weeks. I don't see how a bad push could have done what happened yesterday.
I found a paper that describes the process in detail. See page 10-11:
https://web.archive.org/web/20211005034928/https://research....
Phase Specification
P1 Small number of RSWs in a random DC
P2 Small number of RSWs (> P1) in another random DC
P3 Small fraction of switches in all tiers in DC serving web traffic
P4 10% of switches across DCs (to account for site differences)
P5 20% of switches across DCs
P6 Global push to all switches
We classify upgrades in two classes: disruptive and non-disruptive, depending on if the upgrade affects existing forwarding state on the switch. Most upgrades in the data center are non-disruptive (performance optimizations, integration with other systems, etc.). To minimize routing instabilities during non-disruptive upgrades, we use BGP graceful restart (GR) [8]. When a switch is being upgraded, GR ensures that its peers do not delete existing routes for a period of time during which the switch’s BGP agent/config is upgraded. The switch then comes up, re-establishes the sessions with its peers and re-advertises routes. Since the upgrade is non-disruptive, the peers’ forwarding state are unchanged.
Without GR, the peers would think the switch is down, and withdraw routes through that switch, only to re-advertise them when the switch comes back up after the upgrade. Disruptive upgrades (e.g., changes in policy affecting existing switch forwarding state) would trigger new advertisements/withdrawals to switches, and BGP re-convergence would occur subsequently. During this period, production traffic could be dropped or take longer paths causing increased latencies. Thus, if the binary or configuration change is disruptive, we drain (§3) and upgrade the device without impacting production traffic. Draining a device entails moving production traffic away from the device and reducing effective capacity in the network. Thus, we pool disruptive changes and upgrade the drained device at once instead of draining the device for each individual upgrade. Push Phases. Our push plan comprises six phases P1-P6 performed sequentially to apply the upgrades to agent/config in production gradually.
We describe the specification of the 6 phases in Table 4. In each phase, the push engine randomly selects a certain number of switches based on the phase’s specification. After selection, the push engine upgrades these switches and restarts BGP on these switches. Our 6 push phases are to progressively increase scope of deployment with the last phase being the global push to all switches. P1-P5 can be construed as extensive testing phases: P1 and P2 modify a small number of rack switches to start the push. P3 is our first major deployment phase to all tiers in the topology.
We choose a single data center which serves web traffic because our web applications have provisions such as load balancing to mitigate failures. Thus, failures in P3 have less impact to our services. To assess if our upgrade is safe in more diverse settings, P4 and P5 upgrade a significant fraction of our switches across different data center regions which serve different kinds of traffic workloads. Even if catastrophic outages occur during P4 or P5, we would still be able to achieve high performance connectivity due to the in-built redundancy in the network topology and our backup path policies—switches running the stable BGP agent/config would re-converge quickly to reduce impact of the outage. Finally, in P6, we upgrade the rest of the switches in all data centers.
Figure 7 shows the timeline of push releases over a 12 month period. We achieved 9 successful pushes of our BGP agent to production. On average, each push takes 2-3 weeks
"Post hoc ergo propter hoc"
If your business "relies on Facebook," it's already fucked. You should see this as a wake-up call to GTFO.
“ We also have no evidence that user data was compromised as a result of this downtime.”
Well that happened already. No worries.
who. cares.
May the outage be longer. And May Mark be removed as its snakehead.
gggg5000842
Facebook just can't admit it went like this https://www.youtube.com/watch?v=uRGljemfwUE
Informaticien
Remember, remember, the 4th of October.
Yes. It is true. If you enter Facebook into Facebook. It will break the internet.
Could you please be more clear about ''no evidence that user data was compromised''
Do you think DLT/ blockchain can minimize this from happening again in the future?
Reading this statement all I can think of is this scene https://www.youtube.com/watch?v=15HTd4Um1m4
One of the things they restored was annoying sounds in the app every time I tap anything. Who knew that was DNS related!
TL;DR
We YOLO'd our BGP experiment to prod. It failed.
https://web.archive.org/web/20210626191032/https://engineeri...
I thought DARPA designed the internet to survive nuclear war - no single point of failure - clearly Facebook's network breaks that rule. They need a DNS of last resort that doesn't update fast.
Such a BS. FB imagining that they are their own Internet but failing in a most miserable way because they need actual Internet to communicate.
The best course of action is to split FB into separate companies. It is already neatly divided between instagram, WU and legacy facebook. That would be the best for the government to avoid disruptions.
> We also have no evidence that user data was compromised as a result of this downtime.
I am not sure why they had to mention this specifically. This makes it sound like an external attack.
It has been painfully admitted by the Facebook mafia that they know that they are the internet and farming the data of an entire civilisation; further evidence that this deep integration of their services needs to be broken up.
After all the scandals, leaks, whistleblowers etc it would take more than a DNS record wipe to take down the Facebook mafia.
It just occurred to me to wonder if Facebook has a Twitter account and if they used it to update people about the outage. It turns out they do, and they did, which makes sense. Boy, it must have been galling to have to use a competing communication network to tell people that your network is down.
It looks like Zuckerberg doesn't have a personal Twitter though, nor does Jack Dorsey have a public Facebook page (or they're set not to show up in search).