April 30, 2011

Amazon Apologizes For Server Disaster

Amazon issued a technical 6,000-word statement on Friday apologizing for the outage that occurred a week earlier when its web hosting service, Elastic Compute Cloud or EC2, experienced a massive crash and caused a number of companies such as Foursquare, Reddit and Quoro to go offline, some for several days.

"We want to apologize," the statement says.

Amazon has acknowledged that some of its customers lost data as a result of the technical problems, and has offered a 10-day credit to customers whose websites were affected.

"We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services," Amazon says.

The Seattle-based company did not disclose how much the credit would cost it.

Best known as an online retailer, Amazon is also a major provider of cloud-computing services, which rents out space on its powerful servers to customers around the world.

Although Amazon Web Services (AWS) accounts for only a few percent of Amazon's total revenue, the company visualizes high potential for the service, which rents out computer time by the hour.

The data center outage near Dulles Airport, just outside of Washington, was a major setback for EC2, with Amazon still trying to restore some of the computers that were brought down on April 21.

"As with any complicated operational issue, this one was caused by several root causes interacting with one another," Amazon wrote.

According to Amazon, human error set off the outage causing an automated error-recovery mechanism to get out of control -- many computers became "stuck" in recovery mode.

AWS tried to upgrade capacity in one storage section, or "availability zone," of its regional network in Northern Virginia on April 21. These availability zones exist in each region, with information spread across several zones in an effort to protect against data loss or downtime, reports CNN Money.

Instead of redirecting the traffic within its primary network, which was required for the upgrade, Amazon accidentally sent it to a backup network. This secondary network was not designed to handle the massive flood of traffic, and overwhelmed the system by clogging it up and cutting out a bunch of storage nodes from the network.

"The traffic shift was executed incorrectly," Amazon said.

A failsafe triggered when Amazon fixed the traffic flow, causing the storage volumes to go haywire, trying to search for a place to back up their data. That kicked off a "re-mirroring storm," that filled up all available storage space.

The storage volumes got "stuck" when they couldn't find any way to back itself up. About 13% of the availability zone's volumes were stuck at the peak of the problem.

AP reports that the service is set up in a way that's supposed to provide redundancy, by letting computers in a different "availability zone" take over when one fails.

According to Amazon, customers that were properly set up to run their computing tasks over multiple zones were largely unaffected. However the error has made it difficult to switch zones on the fly. The company is making changes to prevent this error from recurring.

"The trigger for this event was a network configuration change... We will audit our change process and increase the automation to prevent this mistake from happening in the future," Amazon states.

The company promised to be more forthcoming in the future.

"In addition to the technical insights and improvements that will result from this event, we also identified improvements that need to be made in our customer communications," it continued.

Knowing about and repairing those weaknesses will make EC2 even stronger, Amazon says. Several fixes and adjustments have already been made by the company, with plans to deploy additional ones over the next few weeks.

The mistake presented "many opportunities to protect the service against any similar event reoccurring," Amazon says.


On the Net: