January 2, 2013
Netflix Blames Amazon For Outage, Retailer Explains Hiccup
Michel Harper for redOrbit.com — Your Universe Online
It was an outage made sinister by its timing. Similar outages have also topped the headlines of many a´ technology-themed Web site, yet the fact that Netflix became inaccessible on Christmas Eve of all days will likely be remembered for years to come.
When Netflix finally gave an official statement concerning the outage, they were not shy in pointing the finger of blame towards Amazon´s web services, or AWS. Many other popular web services, such as Foursquare or Instagram use AWS and have suffered their own outages due to mistakes or hiccups on their Amazon-owned back-end.
Now, Amazon is busy explaining how such an outage occurred. Being able to stream thousands of movies to a television or tablet is a feat only the future could bring us, made possible by countless advanced technological processes. Yet, as it turns out, such a service can still be brought down by just one human who made just one unfortunate mistake.
“We want to apologize,” wrote a representative for the AWS team in a recent note. “We know how critical our services are to our customers´ businesses, and we know this disruption came at an inopportune time for some of our customers.”
According to the AWS team, the problem occurred with the Amazon Elastic Load Balancing (ELB) service, and while they claim this outage only affected those services which make use of the ELB, they have said that the services affected by the outage felt the impact for a “prolonged period of time.”
Many Netflix users– but not all– began to notice some issues connecting to the service during the afternoon of Christmas Eve. Not only were these users unable to demo the potential of their new smartphones and tablets to their family, they were also unable to use Netflix as a curtain with which to cover their unease from spending time with their family.
According to the AWS, the afternoon of Christmas Eve is right about the time one developer accidentally deleted a portion of the ELB state data.
“The data was deleted by a maintenance process that was inadvertently run against the production ELB state data,” explains the AWS team. “This process was run by one of a very small number of developers who have access to this production environment. Unfortunately, the developer did not realize the mistake at the time.”
Once they discovered what had gone wrong, the AWS team set about righting the issue to restore service as quickly as possible.
Since they were still unaware the problem existed with the ELB, the team first began to look for any issues with the API.
“The team was puzzled as many APIs were succeeding (customers were able to create and manage new load balancers but not manage existing load balancers) and others were failing. As this continued, some customers began to experience performance issues with their running load balancers. These issues only occurred after the ELB control plane attempted to make changes to a running load balancer.”
Had the AWS team noticed the missing ELB data in the beginning, they would have been able to restore service much more quickly. However, as noted by the team themselves, they spent several hours focused on APIs. All told, it took the AWS team over 22 hours to restore operation to all affected load balancers.
The AWS team is now saying they´ve learned from their mistakes and have taken steps to ensure this sort of issue doesn´t occur again in the future.
“We will do everything we can to learn from this event and use it to drive further improvement in the ELB service.”