Hacker News

AWS Operational issue – Multiple services in us-east-1

by gregimbaon 7/30/2024, 11:25:51 PM with 9 comments

by tcason 7/31/2024, 12:29:58 AM
My guess is this is all due to CloudWatch logs putlogevents failures.
By default a docker container configured with awslogs runs in "blocking" mode. As logs get logged, docker will buffer them and push to CloudWatch logs frequently. In case the log stream is faster than what the buffer can absorb, stdout/stderr get blocked and then the container will freeze on the logging write call. If putlogevents is failing, buffers are probably filling up and freezing containers. I assume most of AWS uses it's own logging system, which could cause these large, intermittent failures.
If you're okay dropping logs, add something like this to the container logging definition:
```
  "max-buffer-size": "25m"
  "mode": "non-blocking"
```
by ackdeshaon 7/30/2024, 11:58:55 PM
It seems to have cascaded from AWS Kinesis...
[03:59 PM PDT] We can confirm increased error rates and latencies for Kinesis APIs within the US-EAST-1 Region. We have identified the root cause and are actively working to resolve the issue. As a result of this issue, other services, such as CloudWatch, are also experiencing increase error rates and delayed Cloudwatch log delivery. We will continue to keep you updated as we make progress in resolving the issue.
39 affected services listed:
AWS Application Migration Service
AWS Cloud9
AWS CloudShell
AWS CloudTrail
AWS CodeBuild
AWS DataSync
AWS Elemental
AWS Glue
AWS IAM Identity Center
AWS Identity and Access Management
AWS IoT Analytics
AWS IoT Device Defender
AWS IoT Device Management
AWS IoT Events
AWS IoT SiteWise
AWS IoT TwinMaker
AWS License Manager
AWS Organizations
AWS Step Functions
AWS Transfer Family
Amazon API Gateway
Amazon AppStream 2.0
Amazon CloudSearch
Amazon CloudWatch
Amazon Connect
Amazon EMR Serverless
Amazon Elastic Container Service
Amazon Kinesis Analytics
Amazon Kinesis Data Streams
Amazon Kinesis Firehose
Amazon Location Service
Amazon Managed Grafana
Amazon Managed Service for Prometheus
Amazon Managed Workflows for Apache Airflow
Amazon OpenSearch Service
Amazon Redshift
Amazon Simple Queue Service
Amazon Simple Storage Service
Amazon WorkSpaces
by jmward01on 7/31/2024, 12:41:37 AM
This is a bigger deal than the 'degraded' implies. SQS has basically ground to a halt for reads which is leading to massive slowdowns where I am at and the logging issues are causing task timeouts.
by rushingcreekon 7/31/2024, 12:05:22 AM
The us-east-1 curse strikes again! Elastic Container Service is down for us completely.
by chucky_zon 7/30/2024, 11:30:29 PM
This is just starting to effect us, looks like SQS is the biggest loser right now.
by WheatMillingtonon 7/31/2024, 1:23:57 AM
Our accounting system Xero is down, with reference on their status page to AWS. Related to this, I assume.
https://status.xero.com/
by couton 7/30/2024, 11:39:05 PM
Though it is not listed in the 33 affected services, we are seeing an issue communicating with S3 via a Storage Gateway.
by catlifeonmarson 7/30/2024, 11:44:32 PM
Managed CloudFormation StackSets aren’t showing up for me. I assume this is related to Organizations.