AWS Operational issue – Multiple services in us-east-1

  • My guess is this is all due to CloudWatch logs putlogevents failures.

    By default a docker container configured with awslogs runs in "blocking" mode. As logs get logged, docker will buffer them and push to CloudWatch logs frequently. In case the log stream is faster than what the buffer can absorb, stdout/stderr get blocked and then the container will freeze on the logging write call. If putlogevents is failing, buffers are probably filling up and freezing containers. I assume most of AWS uses it's own logging system, which could cause these large, intermittent failures.

    If you're okay dropping logs, add something like this to the container logging definition:

      "max-buffer-size": "25m"
      "mode": "non-blocking"

  • It seems to have cascaded from AWS Kinesis...

    [03:59 PM PDT] We can confirm increased error rates and latencies for Kinesis APIs within the US-EAST-1 Region. We have identified the root cause and are actively working to resolve the issue. As a result of this issue, other services, such as CloudWatch, are also experiencing increase error rates and delayed Cloudwatch log delivery. We will continue to keep you updated as we make progress in resolving the issue.

    39 affected services listed:

    AWS Application Migration Service

    AWS Cloud9

    AWS CloudShell

    AWS CloudTrail

    AWS CodeBuild

    AWS DataSync

    AWS Elemental

    AWS Glue

    AWS IAM Identity Center

    AWS Identity and Access Management

    AWS IoT Analytics

    AWS IoT Device Defender

    AWS IoT Device Management

    AWS IoT Events

    AWS IoT SiteWise

    AWS IoT TwinMaker

    AWS License Manager

    AWS Organizations

    AWS Step Functions

    AWS Transfer Family

    Amazon API Gateway

    Amazon AppStream 2.0

    Amazon CloudSearch

    Amazon CloudWatch

    Amazon Connect

    Amazon EMR Serverless

    Amazon Elastic Container Service

    Amazon Kinesis Analytics

    Amazon Kinesis Data Streams

    Amazon Kinesis Firehose

    Amazon Location Service

    Amazon Managed Grafana

    Amazon Managed Service for Prometheus

    Amazon Managed Workflows for Apache Airflow

    Amazon OpenSearch Service

    Amazon Redshift

    Amazon Simple Queue Service

    Amazon Simple Storage Service

    Amazon WorkSpaces

  • This is a bigger deal than the 'degraded' implies. SQS has basically ground to a halt for reads which is leading to massive slowdowns where I am at and the logging issues are causing task timeouts.

  • The us-east-1 curse strikes again! Elastic Container Service is down for us completely.

  • This is just starting to effect us, looks like SQS is the biggest loser right now.

  • Our accounting system Xero is down, with reference on their status page to AWS. Related to this, I assume.

    https://status.xero.com/

  • Though it is not listed in the 33 affected services, we are seeing an issue communicating with S3 via a Storage Gateway.

  • Managed CloudFormation StackSets aren’t showing up for me. I assume this is related to Organizations.