March 4, 2013
Massive Server Crash Knocks Out CloudFare Leaving 785K Websites Down
Michael Harper for redOrbit.com — Your Universe Online
All those looking to share pictures and Internet memes or check out top government secrets yesterday were left wanting after the website accelerator CloudFare took a brief nosedive. After a change to CloudFare´s routers was rolled out early Sunday morning, these routers crashed, causing an outage of nearly an hour for the 785,000 websites they serve.
Though the company constantly monitors their routers to avoid such an outage, this crash made it necessary for them to manually reboot the routers. This meant that they had to access each of their 23 global data centers.
“This is a completely unacceptable event to us,” explained co-founder and CEO Matthew Prince in a phone interview with TechCrunch. “In our four years of life, this is our third significant outage.”
CloudFare offers services to protect websites and users by placing an extra layer of digital defense between customers and websites. Their service can also cache sites, meaning visitors can have their web content loaded more quickly. The service also protects websites from the Distributed Denial of Service (DDoS) attacks which are so popular amongst hacktivists and other cyber scoundrels. While a helpful and popular service, CloudFare´s customers are also at their mercy. As seen this weekend, when CloudFare goes down, so too do their customers.
“At around 9:47 UTC (1:47 AM in California), a change got pushed out. It caused the edge routers in our network to crash,” said Prince, explaining the outage to Romain Dillet of TechCrunch.
“I don´t want to throw the routers´ vendor under the bus, but it caused them to crash. If you sent a packet to one of our IP addresses, you would get back a response that there was no router.”
Prince said the first of the downed routers began coming online within 30 minutes, but it took about an hour to return to normal traffic levels.
“CloudFlare´s ops and network teams were aware of the incident immediately because of both internal and external monitors we run on our network,” said Prince, giving further details in a blog post. Ironically, this blog post explaining the down time is inaccessible at the time of this writing.
“While it wasn´t initially clear the reason the routers had crashed, it was clear that it was an issue caused by an inability for packets to find a route to our network. We were able to access some routers and see that they were crashing when they encountered this bad rule. We removed the rule and then called the network operations teams in the data centers where our routers were unresponsive to ask them to physically access the routers and perform a hard reboot.”
According to TechCrunch, CloudFare is so popular and delivers so many page views a month that, were it an actual website, it would be the tenth largest in the world. The company has boasted in the past that they are responsible for delivering 70 billion page views a month to some 600 million unique visitors.
4Chan and Wikileaks were but two of the 785,000 sites which were affected by downed servers.
Prince also made brief mention of compensation to customers who had their websites knocked offline as a result of this router issue, saying CloudFare will “definitely be honoring our paying customers.”