What the Huge AWS Outage Reveals About the Internet

1 month ago 71

A massive cloud outage stemming from Amazon Web Services's key US-EAST-1 region, its hub near the United States capitol in northern Virginia, caused widespread disruptions of websites and platforms around the world on Monday morning. Amazon's main e-commerce platform and other properties including Ring doorbells and the Alexa smart assistant suffered interruptions and outages throughout the morning, as did Meta's communication platform WhatsApp, OpenAI's ChatGPT, PayPal's Venmo payment platform, multiple web services from Epic Games, multiple British government sites, and many others.

The outages stemmed from Amazon's “DynamoDB” database application programming interfaces in US-EAST-1, and AWS said in status updates that the problem was specifically related to DNS resolution issues. The “Domain Name System” is a foundational internet service that essentially acts as an automatic phonebook lookup to translate web URLs like “www.wired.com” into numeric server IP addresses so web browsers show users the right content. DNS “resolution” issues occur when DNS servers aren't accurately connecting these dots and, to keep with the phonebook analogy, are providing the wrong numbers for a given name, or vice versa.

“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1,” AWS wrote in status updates on Monday. Shortly after the company added: “If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches.”

An AWS spokesperson did not immediately respond when asked for details about the nature of the failure. DNS resolution issues can be malicious—known as DNS hijacking—but there is no indication that Monday's AWS outages were nefarious.

“When the system couldn't correctly resolve which server to connect to, cascading failures took down services across the internet,” says Davi Ottenheimer, a longtime security operations and compliance manager and a vice president at the data infrastructure company Inrupt. “Today's AWS outage is a classic availability problem, and we need to start seeing it more as data integrity failure.”

Problems began around 3 am ET. By 5:22 am ET AWS had applied “initial mitigations” that were starting to take effect. At 6:35 am ET, Amazon said that it had fully addressed the underlying technical issues but that “some services will have a backlog of work to work through, which may take additional time to fully process.”

Read Entire Article