Tuesday, December 07, 2021

Amazon Web Services outage disrupts dozens of sites



AWS outage Amazon Web Services logo displayed on a phone screen and a laptop keyboard are seen in this illustration photo taken in Krakow, Poland on Dec. 1, 2021. A major AWS outage disrupted access to numerous popular sites for several hours on Tuesday, Dec. 7, 2021, including Prime, Disney+, Netflix and Ring, among others. 

(Jakub Porzycki/NurPhoto via Getty Images)December 07, 2021 at 5:15 pm PSTBy Kelli Dugan, Cox Media Group National Content Desk

A major Amazon Web Services outage disrupted access to numerous popular sites for several hours on Tuesday, including Prime, Disney+, Netflix and Ring, among others.

Some services began coming back online sporadically just before 5 p.m. EST, following the more than six-hour outage.

Update 8:15 p.m. EST Dec. 7: Amazon Web Services issued a notice just before 8 p.m. Tuesday indicating that its “network device issues have been resolved” and that the company is working to recover “any impaired services.”

Update 6:47 p.m. EST Dec. 7: In a statement provided to CNBC, Amazon spokesperson Richard Rocha confirmed that the outage impacted the company’s warehouse and delivery operations,” noting that officials are “working to resolve the issue as quickly as possible.”

The company did not immediately specify how many warehouses and delivery stations were affected by the outage.

According to CNBC, a notice sent to delivery drivers via Amazon Chime, an internal chat app, stated that the company was “currently monitoring a network-wide technical outage” impacting delivery operations.

“Should drivers be unable to continue delivering due to the outage, go to a nearby safe location and stand by,” the message continued.

Meanwhile, Samuel Caceres, an Amazon driver in Washington state, told the network that his delivery facility has been “at a standstill” since 8 a.m. PST and that drivers and warehouse workers had been on standby since then.

Update 6:35 p.m. EST Dec. 7: Doug Madory, director of internet analysis at Kentik Inc, a network intelligence firm, confirmed to The Associated Press that the issue arose midmorning on the U.S. East Coast at AWS’ largest data center.

Both Atlanta-based Delta Air Lines and Houson-based Southwest Airlines reported AWS-related interruptions, with the latter switching to West Coast servers as a workaround and avoiding major disruptions to flights.

According to the AP, airlines American, United, Alaska and JetBlue were unaffected by the outage, but DownDetector indicated services such as Instacart, Venmo, Kindle and Roku, as well as the McDonald’s app were not as fortunate.

“More and more these outages end up being the product of automation and centralization of administration,” Madory told the AP, adding, “This ends up leading to outages that are hard to completely avoid due to operational complexity but are very impactful when they happen.”

Meanwhile, Kentick experienced a 26% drop in traffic to Netflix, among other major web-based services affected by the outage, he said.

In an emailed response to questions from the AP, the U.S. Cybersecurity and Infrastructure Security Agency stated that it was working with Amazon “to understand any potential impacts this outage may have for federal agencies or other partners.”

Update 6:09 p.m. EST Dec. 7: By 6 p.m. the company reported “significant recovery” from the outage but continued to “closely monitor the health” of the affected network devices. Amazon did not disclose any additional details about the cause and did not provided a timeline for full recovery of services.

AWS provides cloud computing services to myriad governments, universities and private companies.

The news came too late, however, for many travelers temporarily stranded by the major outage.

Original report: A notice on Amazon Web Services’ status page identified the suspected root of the issue as a problem with its application programming interface, or API, as well as with the AWS Management Console, CNBC reported.

The issues impacted AWS’ main US-East-1 region hosted in Northern Virginia, meaning not all users experienced interruptions, the company confirmed.

The company later stated that an increase in traffic between specific internal services is causing network congestion between those devices, and that it is working to resolve the bottleneck.

According to Reuters, outage tracker Downdetector showed more than 24,000 incidents of people reporting issues with Amazon.

Other affected sites included, but were not limited to, Prime Video, messaging service Slack, mobile banking app Chime, robot vacuum cleaner maker iRobot, stock trading app Robinhood and Coinbase, the largest cryptocurrency exchange in the U.S., as well as numerous in-house Amazon Warehouse tools, such as the Flex and AtoZ app, CNBC reported.

According to the network, the warehouse tool outage made it impossible to scan packages or access delivery routes.

In July, Amazon experienced a disruption in its online stores’ service that affected more than 38,000 users in only about two hours. Meanwhile, users have endured 27 Amazon-related outages during the past 12 months, Reuters reported, citing web tool reviewing website ToolTester.

-- The Associated Press contributed to this report.

Furious customers blast Amazon as an outage knocks Ring doorbells, baby monitors, and Alexa products offline

A Ring doorbell in 2019. Chip Somodevilla/Getty Images


Amazon Ring users on social media are furious they cannot access home monitoring services.
Amazon's AWS cloud servers were down for most of the day Tuesday as the company investigated an outage.
"How the hell are we supposed to disarm our alarms. Or monitor anything at all," one user tweeted.

Amazon Ring users online are fuming as they lost access to home monitoring services during the company's major outage.

Amazon's web-hosting subsidiary, Amazon Web Services, suffered a major outage on Tuesday, impacting a number of services that rely on the company's servers.

The outage extended to popular Amazon services, like the Ring smart home system and Alexa speakers.

"Can't listen to Amazon Music. My Ring Doorbell doesn't work. Can't control my lights with Alexa," one user tweeted.

Ring users said issues with the home monitoring service has resulted in the inability to disable alarms, monitor children, and watch out for intruders.

"I'm unable to access any of my cameras, is there a nationwide outage? I'm literally relying on my cameras to keep me safe because I'm on bed rest," one user said in a tweet.

"How the hell are we supposed to disarm our alarms. Or monitor anything at all," another tweeted.

The outage began around 11:30 a.m. ET. As of about 2 p.m. ET, Amazon said its technical teams identified the root cause of the outage and are working on a solution, according to its website. Amazon added some services have began experiencing partial recovery.

"Most pathetic service," one Twitter user said. "How can a doorbell or a security camera be down for several hours, and still no signs of resumption!"

"Eek, just in time for my newly-in-beds triplet toddlers' nap," another user said. "Now I have no idea what they're doing up there."

Many Ring doorbell users said they were concerned their packages would get stolen during the systems outage.

"What's going on ring? App down, can't login, can't see what's going on outside my door," a user tweeted. "I got packages coming and I'm not about to get caught lacking."

"Great the one day we had a package taken from our porch...seriously," tweeted another Ring customer. "Hopefully the camera caught it."

"Do we have an update? [My Ring] has been down for a few hours now and I'm expecting packages. Some of which need to be signed for today," another Twitter user said.

"We are aware of a service interruption impacting Ring," the company said in a statement. "We apologize for the inconvenience and appreciate your patience and understanding."

Amazon was not immediately available for additional comment

Problems With AWS Network Devices Caused Widespread Cloud Outage

BY RICH MILLER - DECEMBER 7, 2021 


Amazon Web Services data centers in Loudoun County, Virginia. (Photo: Rich Miller)

Problems with several network devices in Northern Virginia caused a major outage at Amazon Web Services, with the ripples spreading across the Internet to interrupt service for many popular web services that run their infrastructure on the AWS cloud.

The lengthy outage highlighted the essential role played by cloud platforms like AWS, which support the web operations of at least 1 million enterprise customers. The problems at AWS were blamed for performance issues at Netflix, Disney+, Ring, Ticketmaster, Venmo, Roku. Fidelity Investments, Hootsuite, and many others. The outage interrupted online finals for students using the Canvas Learning Management platform, and even deliveries at Amazon warehouses, as the outage impacted apps required to scan packages and plan delivery routes.

The AWS outage was focused on US-East-1, a service region based in Northern Virginia which houses the largest concentration of Amazon data center infrastructure. The problems began at around 12:30 p.m. Eastern, when users began to experience problems accessing AWS services. Approximately 5 hours later, at 5:47 p.m., AWS reported that it had “mitigated the underlying issue” and services were beginning to be restored.

“The root cause of this issue is an impairment of several network devices in the US-EAST-1 Region,” AWS said on its status page. As of 7:30 pm Eastern, AWS said the network devices issues had been resolved, and it was “now working towards recovery of any impaired services.”

Large-scale IT service outages can be expensive. A 2021 survey from The Uptime Institute found that data center outages cost companies an average of $100,000 per incident, with about a third of respondents citing costs of $1 million or more.

The stakes could be even higher for Amazon Web Services, which is the largest cloud computing platform. AWS had revenue of $16.4 billion in the third quarter of 2021, which works out to about $7.4 million per hour. Although cloud workloads running outside the US-East-1 region apparently were unaffected, an outage lasting more than six hours in the largest cloud region would add up quickly – although such “losses” at service providers are often accounted for through customer credits.

Why Networks Are So Important

The rise of cloud computing underscores the importance of networks and how they are configured. Networking and software issues are surpassing power outages as the most common causes of data center downtime, according to 2021 outage data from Uptime Institute. This trend reflects the growing role of cloud computing and SaaS (software as a service) applications, which often use architectures that can route around physical failures of electrical components like UPS systems, transfer switches and generators.

When Amazon Web Services experiences reliability problems, they often involve US -East-1, which is not surprising because it is the largest AWS region and also the oldest, as Amazon has had data centers in Virginia since 2004. AWS has spent $35 billion on its cloud computing infrastructure in Northern Virginia over the past 10 years, and operates about 50 data centers in the region. It’s the largest single concentration of corporate data centers on earth, positioned near a strategic Internet intersection in Ashburn, which serves as a global crossroads for data traffic.

Network problems are complicated by the highly-automated nature of cloud platforms. These data traffic flows are designed to be large and fast and work without human intervention – which makes them hard to tame when humans intervene. Some of the largest outages impacting cloud platforms and social networks have been tied to network problems. 

Some examples:
On October 5, a configuration error broke Facebook’s connection to a key network backbone, disconnecting all of its data centers from the Internet and leaving its DNS servers unreachable, the company said.

A lengthy 2019 Google outage was caused by unusual network congestion in its operations in the Eastern U.S. In an incident report, Google said that YouTube measured a 10 percent drop in global views during the incident, while Google Cloud Storage measured a 30 percent reduction in traffic.
Resiliency is Still A Challenge

At DCF we have often noted how cloud computing is bringing change to how companies approach uptime, introducing architectures that create resiliency using software and network connectivity (See “Rethinking Redundancy”). This strategy, pioneered by cloud providers, is creating new ways of designing applications. Data center uptime has historically been achieved through layers of redundant electrical infrastructure, including uninterruptible power supply (UPS) systems and emergency backup generators.

Cloud providers like Google have been leaders in creating failover scenarios that shift workloads across data centers, spreading applications and backup systems across multiple data centers, and using sophisticated software to detect outages and redirect data traffic to route around hardware failures and utility power outages.

Amazon Web Services has been a pioneer in this effort by popularizing the use of availability zones (AZs), clusters of data centers within a region that allow customers to run instances of an application in several isolated locations to avoid a single point of failure. These architectures enable sophisticated approaches to failover and backup of applications. But even a distributed uptime plan can break down if the network fails, breaking the flow of data across cloud infrastructure.

No comments: