AWS outage disrupts Facebook, Snapchat and Amazon before recovery begins

Amazon Web Services confirmed its systems were recovering following a major outage that disrupted dozens of websites and applications including Facebook, Snapchat, Amazon, Coinbase and Robinhood, reports The Wall Street Journal, which was among several media organisations affected by the outage.

The outage, which began around 3:00 AM Eastern Time, affected major retailers, airlines, social media applications, financial services companies and productivity tools across the AWS US-EAST-1 region centred around Northern Virginia. Sites including Slack, United Airlines, AI tool Perplexity and videogames Fortnite and Roblox experienced disruptions.

AWS traced the problem to its DynamoDB system, which provides websites with database storage and computing power. The service has more than one million customers across retail, financial services, media and entertainment sectors, with clients including Disney+, Zoom, Airbnb, Lyft, Dropbox and Nike.

The company identified the root cause at 2:01 AM Pacific Daylight Time as a DNS resolution issue affecting the DynamoDB API endpoint in US-EAST-1. AWS stated it was “working on multiple parallel paths to accelerate recovery” with the issue also affecting other services in the region.

Early signs of recovery

Engineers applied initial mitigations at 2:22 AM PDT with early signs of recovery appearing for some impacted services. By 3:35 AM PDT, AWS confirmed “the underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now.”

However, requests to launch new EC2 instances and services that launch EC2 instances such as ECS continued experiencing increased error rates. AWS recommended customers configure EC2 instance launches without targeting specific Availability Zones and that Auto Scaling Groups be configured to use multiple zones.

Some services continued working through backlogs of events, including CloudTrail and Lambda, following initial recovery. AWS reported elevated polling delays for Lambda Event Source Mappings for SQS, affecting features depending on Lambda’s SQS polling capabilities including Organisation policy updates.

Global services and features relying on US-EAST-1 endpoints, including IAM updates and DynamoDB Global Tables, also experienced issues during the outage before recovering at 3:03 AM PDT.

The AWS infrastructure underpins millions of websites and platforms, providing cloud computing services such as servers and storage to major companies globally. The service is the largest cloud computing provider in the United States.