Bug Repellant For Supercomputers
November 15, 2012

Finding Bugs In Supercomputers Just Became Easier

Michael Harper for redOrbit.com — Your Universe Online

For many computer users, finding a bug or an error in their system can be more than difficult, it can be a chore with no clear beginning or ending. Personal computers of all types and platforms have become increasingly sophisticated in even the last 5 years, yet they cannot perform nearly as well as today´s supercomputers. These shelves and racks of processors and other components run so many calculations, they have to be defined with a special word: The petaflop, or quadrillions of calculations per second.

Finding and dealing with a “bug” in one of these systems makes the same process on a PC look like child´s play; like looking for the red circle amongst a pile of blue squares.

Now, the team at Lawrence Livermore National Laboratory (LLNL) say they´ve finally found a tool to help them discover and diagnose bugs in their supercomputer. Called the Stack Trace Analysis Tool (STAT), this sophisticated piece of software is said to be lightweight, scalable and capable of finding bugs in a system while it´s churning out more than 1 million MPI processes.

STAT is now being used to find bugs on the IBM BlueGene/Q-based supercomputer Sequoia, which recently ranked number 2 in the Top 500 Supercomputers list.

Finding bugs while Sequoia was running fewer processes was difficult enough, but according to the research team for STAT, these bugs became more difficult to find and manifested themselves in some very perplexing ways once the computer began ramping up its processes. While the LLNL team began ramping up Sequoia, they began to notice some software defects as well as application failures. They began to use STAT and have since been able to diagnose and solve many of these issues.

"STAT has been indispensable in this capacity, helping the multi-disciplined integration team keep pace with the aggressive system scale-up schedule," explained Greg Lee, computer scientist at LLNL.

"While testing a subsystem of Blue/Gene Q, my test program consistently failed only when scaled to 1,179,648 MPI processes. Although the test program was simple, the sheer scale at which this program ran made debugging efforts highly challenging. But when I applied STAT, it quickly revealed that one particular rank process was consistently stuck in a system call," said Dong Ahn, a computer scientist in Livermore Computing. After discovering what was causing the bottleneck in the system, one system expert began to look at the specific core running this process and noticed the problem lay in the hardware itself. Ahn said replacing this piece of hardware solved the problem, and suddenly Sequoia was back up and churning through millions of processes once more.

"Putting this exercise into perspective, this error was due to a defect in a tiny hardware unit, the decrementor, of a single hardware thread out of a total of 4.7 million hardware threads. I felt it was like finding a needle in a haystack over a coffee break."

STAT will prove very helpful to the LLNL team. Having debugging software which is capable of running in step with the computer will allow the team to actively monitor their system while still using it to run through its calculations. STAT has also been used on other supercomputer platforms, such as Linux and Cray, the platform used by the new number 1 supercomputer in Oak Ridge National Laboratory, Titan.