FAA Outage Highlights Need For Modernization

The turmoil caused by a software glitch at the Federal Aviation Administration (FAA) on Tuesday, which caused widespread flight disruptions, shed light on the antiquated system used by the agency, and its urgent need for modernization.

Through its use of practices that would be considered inadequate in other critical sectors, the FAA allowed its systems to be vulnerable to the glitch, which occurred as new software was loaded at a flight plan distribution center in Atlanta.

Since the agency depends on just two such centers, one each in Salt Lake City and Atlanta, to manage flight plans for the country, the software glitch all but shut down the entire system. And although the Salt Lake center remained operational and served as a backup, it became overloaded, resulting in more than 600 flights being delayed throughout the eastern United States.
 
A failure at the same facility in June 2007 also caused significant flight delays across the East Coast.

Glitches such as the one that occurred this week can often be avoided with sufficient system redundancy, meaning alternate systems and communication channels are in place to handle the workload should one system fail. 

In fact, proper redundancy is so critical for utility companies that those found insufficiently prepared can be penalized with daily fines of up to thousands of dollars, and $1 million per day if they are found deliberately negligent.

“In the industries I work in, if you have something that critical, you generally build more redundancy,” Jason Larsen, a security researcher with consultancy IOActive Inc., told the Associated Press. 

Larson had spent five previous years at the Idaho National Laboratory monitoring the control systems of electrical plants.

“If this (FAA outage) happened at a power plant, I’d be telling them to open up their checkbook and expect to be fined.”

Tammy Jones, an FAA spokeswoman, emphasized that these types of issues “don’t happen on a mass scale or a regular basis”, pointing out that the agency manages 50,000 to 60,000 daily fights and that flying on U.S. airlines has never been safer.

“The system is working.”

“We are making sure people are getting from one place to another,” she said.

Basil Barimo, vice president of operations and safety for the Air Transport Association of America, said the basic problem is the FAA’s dependence on older technology, such as a radar-based control system designed in the 1940s and ’50s. But he is nevertheless optimistic that the agency’s NextGen modernization program will make more efficient use of the nation’s airspace while safely allowing more planes to fly. The program includes a $15 billion upgrade to satellite-based technology that will take nearly two decades years to complete.

The National Airspace Data Interchange Network computer, located at the Atlanta facility where this week’s glitch occurred, has been owned and operated by the FAA since the 1980s, after the Netherlands-based firm that developed it went out of business. The network is being upgraded to include additional memory, faster data processing and to be more “fault-tolerant.”

“We should see significant improvements by the end of September…which should prevent the type of problem we had on Tuesday,” said FAA spokeswoman Laura Brown.

The FAA is also looking at installing a third backup system at a technology center in New Jersey, however final decisions have yet to be made, she said.

National Air Traffic Controllers Association spokesman Doug Church claims the FAA has tried to focus on future technology to detract from its deficiency in maintaining current systems.  Church claimed the FAA lacks a “safety net of redundancy”, and cited the agency’s “fix-on-fail” policy of addressing an issue only after it has become a problem.

To Church’s point, in December the agency exempted its computer maintenance staff from having to perform some periodic certification checks as mandated by government handbooks for technical equipment. 

The FAA defended their decision, saying it would eliminate needless certifications that had little or no effect on safety or system performance.   A 2006 Government Accountability Office (GAO) report supported the practice in some instances. However, industry experts say they often advise against such an approach.

“It’s common, you see it in retail too – it’s the whole ‘don’t fix it if it ain’t broke’ thing,” Branden Williams, director of a unit of VeriSign Inc. that reviews the security of retailers’ payment systems, told the AP.

“It’s unfortunate because it’s very reactive, and it typically winds up costing you more. If you do fix-on-fail, it usually costs you more.”

However, an outage occurring at a private company that may delay a retail order is much different than one that happens at the nation’s Federal Aviation Administration.  And outages such as Tuesday’s have happened multiple times with the FAA.

For instance, earlier this month communications an unknown number of planes and a Memphis, TN, air traffic control center that directs planes passing through a 250-mile radius from the city were disrupted after a car hit a utility pole and cut a fiber-optic cable.

And last fall, the same center lost all its communications, requiring some air traffic controllers to use their personal cell phones to route planes out of the area. The FAA said the outage was a result of a failure of one of AT&T’s major communications links.

In May, the FAA system that distributes preflight notices to pilots about equipment, runway and security issues shut down for about a day when a server failed and the backup was ineffective.  Although the database was unable to issue updates or new notices, pilots continued to receive information from local air traffic controllers and through alternate systems.

Referring to this week’s outage, Paul Proctor, a Gartner Inc. analyst focused on security and regulatory compliance for large companies, said it seemed the FAA didn’t install the flight-plan systems with the same amount of redundancy as big companies generally have in their critical systems.

“You need to do a good analysis about whether this is acceptable risk,” Proctor told the AP.

“One of the things the government is betting on is the fact that if there’s…a failure, it’s not a safety issue.”

Sid McGuirk, associate professor and coordinator of the air traffic management program at Embry-Riddle Aeronautical University, and a former air traffic controller and FAA manager for 35 years, said the agency has maintained a good balance given their budget constraints.

“It keeps the system running efficiently without compromising safety,” he said.

“From time to time, we are going to have a glitch, but it’s a tradeoff.”

“Would I like to see more modern equipment in the system? Sure. But most folks would not want to see their taxes tripled to pay for new technology every two years.”