> This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.
I remember my first experience realizing the client retry logic we had implemented was making our lives way worse. Not sure if it's heartening or disheartening that this was part of the issue here.
Our mistake was resetting the exponential backoff delay whenever a client successfully connected and received a response. At the time a percentage but not all responses were degraded and extremely slow, and the request that checked the connection was not. So a client would time out, retry for a while, backing off exponentially, eventually successfully reconnect and then after a subsequent failure start aggressively trying again. System dynamics are hard.
> Customers accessing Amazon S3 and DynamoDB were not impacted by this event.
We've seen plenty of S3 errors during that period. Kind of undermines credibility of this report.
"Amazon Secure Token Service (STS) experienced elevated latencies"
I was getting 503 "service unavailable" from STS during the outage most of the time I tried calling it.
I guess by "elevated latency", they mean from anyone with retry logic that would keep trying after many consecutive attempts?
> This congestion immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it.
Disruption of the standard incident response mechanism seems to be a common element of longer lasting incidents.
I wish it contained actual detail and wasn’t couched in generalities.
Does anyone know how often an AZ experiences an issue as compared to an entire region? AWS sells the redundancy of AZs pretty heavily, but it seems like a lot of the issues that happen end up being region-wide. I'm struggling to understand whether I should be replicating our service across regions or whether the AZ redundancy within a region is sufficient.
I’ve been running platform teams on aws now for 10 years, and working in aws for 13. For anyone looking for guidance on how to avoid this, here’s the advice I give startups I advise.
First, if you can, avoid us-east-1. Yes, you’ll miss new features, but it’s also the least stable region.
Second, go multi AZ for production workloads. Safety of your customer’s data is your ethical responsibility. Protect it, back it up, keep it as generally available as is reasonable.
Third, you’re gonna go down when the cloud goes down. Not much use getting overly bent out of shape. You can reduce your exposure by just using their core systems (EC2, S3, SQS, LBs, Cloudfrount, RDS, Elasticache). The more systems you use, the less reliable things will be. However, running your own key value store, api gateway, event bud, etc., can also be way less reliable than using their’s. So, realize it’s an operational trade off.
Degradation of your app / platform is more likely to come from you than AWS. You’re gonna roll out bad code, break your own infra, overload your own system, way more often than Amazon is gonna go down. If reliability matters to you, start by examining your own practices first before thinking things like multi region or super durable highly replicated systems.
This stuff is hard. It’s hard for Amazon engineers. Hard for platform folks at small and mega companies. It’s just, hard. When your app goes down, and so does Disney plus, take some solace that Disney in all their buckets of cash also couldn’t avoid the issue.
And, finally, hold cloud providers accountable. If they’re unstable and not providing service you expect, leave. We’ve got tons of great options these days, especially if you don’t care about proprietary solutions.
Good luck y’all!
> The AWS container services, including Fargate, ECS and EKS, experienced increased API error rates and latencies during the event. While existing container instances (tasks or pods) continued to operate normally during the event, if a container instance was terminated or experienced a failure, it could not be restarted because of the impact to the EC2 control plane APIs described above.
This seems pretty obviously false to me. My company has several EKS clusters in us-east-1 with most of our workloads running on Fargate. All of our Fargate pods were killed and were unable to be restarted during this event.
Still doesn’t explain the cause of all the IAM permission denied requests we saw against policies which are again working fine without any intervention.
Obviously networking issues can cause any number of symptoms but it seems like an unusual detail to leave out to me. Unless it was another ongoing outage happening at the same time.
There are a lot of comments in here that boil down to "could you do infrastructure better?"
No, absolutely not. That's why I'm on AWS.
But what we are all ACTUALLY complaining about is ongoing lack of transparent and honest communications during outages and, clearly, in their postmortems.
Honest communications? Yeah, I'm pretty sure I could do that much better than AWS.
Something they didn't mention is AWS Billing alarms. These rely on metrics systems which were affected by this (and are missing some data). Crucially, billing alarms only exist in the us-east-1 region, so if you're using them, your impacted no matter where you're infrastructure is deployed. (That's just my reading of it)
> Customers accessing Amazon S3 and DynamoDB were not impacted by this event. However, access to Amazon S3 buckets and DynamoDB tables via VPC Endpoints was impaired during this event.
What does this even mean ? I bet most people use DynamoDB via a VPC, in a Lambda or in EC2
I am not a fan of AWS due to their substantial market share on cloud computing. But as a software engineer I do appreciate their ability to provide fast turnarounds on root cause analyses and make them public.
I am grateful to AWS for this report.
Not sure if any AWS support staff are monitoring this thread, but the article said:
> Customers also experienced login failures to the AWS Console in the impacted region during the event.
All our AWS instances / resources are in EU/UK availability zones, and yet we couldn't access our console either.
Thankfully none of our instances were affected by the outage, but our inability to access the console was quite worrying.
Any idea why this was this case?
Any suggestions to mitigate this risk in the event of a future outage would be appreciated.
I wonder if they could've designed better circuit breakers for situations like this. They're very common in electrical engineering, but I don't think they're as common in software design. Something we should try to design and put in, actually for situations like this.
Their service board is always as green as you have to be to trust it
>At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network.
Just curious, is this scaling an AWS job or a client job? Looks like an AWS one from the context. I'm wondering if they are deploying additional data centers or something else?
Queue the armchair infrastructure engineers.
The reality is that there’s a handful of people in the world that can operate systems at this sheer scale and complexity and I have mad respect for those in that camp.
Between this and Log4j, I'm just glad it's Friday.
My company uses AWS. We had significant degradation for many of their APIs for over six hours, having a substantive impact on our business. The entire time their outage board was solid green. We were in touch with their support people and knew it was bad but were under NDA not to discuss it with anyone.
Of course problems and outages are going to happen, but saying they have five nines (99.999) uptime as measured by their "green board" is meaningless. During the event they were late and reluctant to report it and its significance. My point is that they are wrongly incentivized to keep the board green at all costs.
Problem is that I have to defend our own infrastructure real availability numbers vs cloud's fictional "five nines". It's a loosing game.
Was this outage only impact us-east-1 region? I think I saw other regions affected in some HN comments but this summary did not mention anything to suggest it has more than 1 region impacted.
"At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries."
So was this in service to something like DynamoDB or some other service?
As in, did some of those extra services that AWS offers for lockin (and that undermines open source projects with embrace and extend) bomb the mainline EC2 service?
Because this kind of smacks of "Microsoft Hidden APIs" that office got to use against other competitors. Does AWS use "special hardware capabilites" to compete against other companies offering roughtly the same service?
Idea:. Network devices should be configured to automatically prioritize the same packet flows for the same clients as they served yesterday.
So many overload issues seem to be caused by a single client, in a case where the right prioritization or rate limit rule could have contained any outage, but such a rule either wasn't in place or wasn't the right one due to the difficulty of knowing how to prioritize hundreds of clients.
Using more bandwidth or requests than yesterday should then be handled as capacity allows, possibly with a manual configured priority list, cap, or ratio. But "what I used yesterday" should always be served first. That way, any outage is contained to clients acting differently to yesterday, even if the config isn't perfect.
My favorite sentence: "Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event."
The complexity that AWS has to deal with is astounding. Sure having your main production network and a management network is common. But making sure all of it scales and doesn't bring down the other is what I think they are dealing with here.
It must have been crazy hard to troubleshoot when you are flying blind because all your monitoring is unresponsive. Clearly more isolation with clearly delineated information exchange points are needed.
Hm. This post does not seem to acknowledge what I saw. Multiple hours of rate-limiting kicking in when trying to talk to S3 (eu-west-1). After the incident everything works fine without any remediations done on our end.
Broadcast storm. Never easy to isolate, as a matter of fact it's nightmarish...
"Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event. "
That is an interesting way to phrase that. A 'well-tested' method, but 'latent issues'. That would imply the 'well-tested' part was not as well-tested as it needed to be. I guess 'latent issue' is the new 'bug'.
Obviously one hopes these things don’t happen, but that’s an impressive and transparent write up that came out quickly.
I’m glad they published something, that too so quick. Ultimately these guys are running a business. There are other market alternatives, multibillion dollar contracts at play, SLAs, etc. it’s not as simple as people think.
This congestion immediately impacted the availability of real-time monitoring data for our internal operations teams
I guess this is why it took ages for the status page to update. They didn't know which things to turn red.
"At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. "
Very detailed.
Has anyone been credited by AWS for violations of their SLAs?
Most of rate limiter system often drop invalid requests, it's not optimal as i see.
The better way is, we should have two queues, one for valid messages and one for invalid messages.
Noob question, but why does network infrastructure need dns? Why the full ipv6 address of the various components do not suffice to do business?
A packet storm outage? Now that brings back memories. Last time I saw that it was rendezvous misbehaving.
umm... But just one thing, S3 was not available at least for 20 minutes.
"impact" occurs 27 times on this page.
What was wrong with "affect"?
In a nutshell: thundering herd.
A "service event"?!
House of cards
Exceeded character limit on the title so I couldn't include this detail there, but this is the post-mortem of the event on December 7 2021.
> Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event.
Sentences like this are confusing. If they are well-tested, wouldn't this issue have been covered?
stop using aws, i can't wait till amazon is hit so hard everyday they can't maintain customers
I was alive, however because I could not breath, I died. Bob was fine himself, but someone shot him, so he is dead, (but remember bob was fine) --- What a joke
DNS?
Of course it was DNS.
It is always* DNS.
Yeah, cloudwatch APIs went to the drain. Good for them for publishing this at least.
"... the networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region. By 8:22 AM PST, we were successfully updating the Service Health Dashboard."
Sounds like they lost the ability to update the dashboard. HN comments at the time were theorizing it wasn't being updated due to bad policies (need CEO approval) etc. Didn't even occur to me that it might be stuck in green mode.
Having an internal network like this that everything on the main AWS network so heavily depends on is just bad design. One does not create a stable high tech spacecraft and then fuels it with coal.
> Operators instead relied on logs to understand what was happening and initially identified elevated internal DNS errors. Because internal DNS is foundational for all services and this traffic was believed to be contributing to the congestion, the teams focused on moving the internal DNS traffic away from the congested network paths. At 9:28 AM PST, the team completed this work and DNS resolution errors fully recovered.
Having DNS problems sounds a lot like the Facebook outage of 2021-10-04. https://en.wikipedia.org/wiki/2021_Facebook_outage
Complex systems are really really hard. I'm not a big fan of seeing all these folks bash AWS for this, and not really understanding the complexity or nastiness of situations like this. Running the kind of services they do for the kind of customers, this is a VERY hard problem.
We ran into a very similar issue, but at the database layer in our company literally 2 weeks ago, where connections to our MySQL exploded and completely took down our data tier and caused a multi-hour outage, compounded by retries and thundering herds. Understanding this problem under the stressful scenario is extremely difficult and a harrowing experience. Anticipating this kind of issue is very very tricky.
Naive responses to this include "better testing", "we should be able to do this", "why is there no observability" etc. The problem isn't testing. Complex systems behave in complex ways, and its difficult to model and predict, especially when the inputs to the system aren't entirely under your control. Individual components are easy to understand, but when integrating, things get out of whack. I can't stress how difficult it is to model or even think about these systems, they're very very hard. Combined with this knowledge being distributed among many people, you're dealing with not only distributed systems, but also distributed people, which adds more difficulty in wrapping this around your head.
Outrage is the easy response. Empathy and learning is the valuable one. Hugs to the AWS team, and good learnings for everyone.