> Rogers staff relied on the company’s own mobile and Internet services for connectivity to communicate among themselves. When both the wireless and wireline networks failed, Rogers staff, especially critical incident management staff, were not able to communicate effectively during the early hours of the outage. Rogers had to send Subscriber Identity Module (SIM) cards from other mobile network operators to its remote sites to enable its staff with wireless connectivity to communicate with each other
ooops. that's actually quite funny
in case you didn't know, Rogers is one of canada's "big 3" telecom providers. The outage in 2022 basically crippled our economy for a couple days (most ATMs / interac didn't work)
The "lessons learned" are all very basic. Things like "separate the network management layer from the data network", "provide the network operation center with backup connectivity", etc.
This is networking 101. Heck, this is engineering 101. The real question is how a network provider as large as Rogers managed to be so poorly engineered in the first place.
Outages are inevitable, but the Rogers outage in 2022 had some devastating consequences.
Route leaks can happen to anyone, but the fact that it brought down their entire network, including voice and internet services, across all provinces, was unacceptable.
What's even more concerning is that they had no out-of-band access, which meant no management access to their network. This explains why the outage lasted a whopping 24 hours.
In my opinion, the lack of OOB was the most critical and yet the most preventable. Proper OOB is a must; I wouldn't operate a network without it, I don't understand why Rogers thought that was acceptable.
In my opinion, from what I am reading, there is a root cause analysis in there under the heading "Reliability of Rogers network architecture". I removed the redundant and contradictory parts.
> [...] both the wireless and wireline networks sharing a common IP core network, the scope of the outage was extreme in that it resulted in a catastrophic loss of all services. [...] It is a design choice by [...] Rogers, that seeks to balance cost with performance.
Based on this write up and details I gathered, I believe the root cause, the fundamental reason for the failure is incorrect cost-to-performance balance at senior management.
I have heard the deployment process at Rogers is something along the lines of: 1) get approval for your changes 2) ssh into a server 3) copy a backup of current jar file 4) copy the new instance 5) run some command to refresh
Can we for a second appreciate how freaking cool it is that a government agency published this?
Hilarious. Rogers runs the internet and phone lines so when it went down, the devs couldn't:
1) Remote into the boxes to see what's happening.
2) Talk to other devs because their phones are on Rogers network.
My stress couldn't.
Kye Prigg left Rogers shortly thereafter and ended up at Lumen (merger of CenturyLink and Level3), which is another company facing some issues: https://ir.lumen.com/news/news-details/2023/Lumen-Strengthen... "Kye most recently served as Senior Vice President of Access Networks and Operations for Rogers Communications. "
> The July 2022 outage was not the result of a design flaw in the Rogers core network architecture.
Talk about some grade A gaslighting here. Reading the post mortem they first tell you it wasn't a design flaw then say they routed all their data through one core router ( including a lack of a management network). Then they say they are going to fix things by separating out the wireless and wired traffic.
Why would you fix things if it wasn't a design flaw?
Out of band access is like resilient architecture 101. Hell even homelabs generally have some way to do it. It's appalling that Rogers didn't have a way to access the core IP routers out of band. Yes it might mean having to use a competitors infrastructure but they ended up having to do it anyway. And with the failure of the service now all the infrastructure providers are under additional scrutiny. Rogers should be striking some agreements with other providers to carry core traffic in case of an outage such as in this DR situation. For example Visa, MC, Amex all have agreements in place to process each others auth data in case the other party goes down. The thinking here being an outage for credit cards makes everyone look bad.
For fellow Canadians, I’m quite happy using Beanfield in Toronto. Cheap and decent. Their backbone is probably still Bell or Rogers though, but at least I pay a reasonable $60 for 2gbps down.
[dead]
[dead]
Note that this failure didn’t prevent Rogers from buying Shaw and securing an even larger position in the national telecom system. In BC, for instance, there are only two mobile networks now (Telus and Rogers). That places an extremely high burden on two lazy oligopolists to maintain exceptional reliability.
It’s a shame on Canada and on Canadians that foreign competitors are still prohibited from coming into the market.