Race Condition Starting from DynamoDB
15 Hours That Brought Down EC2, Network, Security, and Customer Call Centers

From 11:48 PM PDT October 19, 2025 to 2:20 PM October 20, 2025: large-scale service outage occurred in AWS North Virginia region (us-east-1). Impact split into three areas: (1) DynamoDB outage (10/19 11:48 PM -- 10/20 2:40 AM): DynamoDB endpoint DNS was incorrectly configured causing API calls to fail and new connections to be impossible -- this did not simply kill DynamoDB alone but simultaneously destabilized numerous AWS internal services depending on DynamoDB; (2) EC2 new instance failures (10/20 2:25 AM -- 10:36 AM, subsequent impact until 1:50 PM): existing instances remained alive but new EC2 instances failed to launch or network failed to attach -- blocking scale-out, auto-healing, and new deployment rollouts; (3) NLB (Network Load Balancer) connection errors (10/20 5:30 AM -- 2:09 PM): load balancers incorrectly assessed healthy instances as "unhealthy" causing traffic routing disruption; some applications became completely inaccessible externally. Simultaneously affected: Lambda, ECS/EKS/Fargate, STS (Security Token Service), Redshift, Amazon Connect (call centers), IAM login, and other core services. Root cause -- DNS race condition: a DNS configuration change during a DynamoDB maintenance window created a race condition (two processes simultaneously competing to modify the same resource); one code path correctly updated DNS, another code path incorrectly reverted it; the race condition outcome was non-deterministic -- sometimes DNS worked, sometimes it did not; this intermittent failure was more difficult to diagnose than a complete failure because automated monitoring showed partial success. Cascade failure mechanism: DynamoDB failure propagated because AWS internal services use DynamoDB as a distributed coordination and state storage system; when DynamoDB became unreliable, services depending on it for health checks, configuration, and state could not operate correctly -- creating cascading failures across seemingly unrelated services. Lessons: (1) DNS changes require race condition analysis even for "simple" configuration updates; (2) Internal service dependency graphs must be maintained and tested; (3) Cascading failure modes need explicit circuit breakers preventing one service failure from propagating across the system; (4) Degraded-mode operation plans (what the system does when DynamoDB is unreliable) need to be designed, not just failure prevention; (5) Recovery speed matters as much as preventing outages -- the 15-hour duration was driven partly by the difficulty of diagnosing intermittent DNS failures during a cascade.