The first exascale supercomputer has a hardware failure every day

Frontier, the world’s most powerful supercomputer, is online but still far from operational. Its director has confirmed reports that it is experiencing a system failure every few hours, but insists that’s par for the course.

Frontier is in a class of its own. It has 9,408 HPE Cray EX235a nodes, each powered by an AMD Trento 7A53 Epyc 64-core CPU equipped with 512 GB of DDR4, and four AMD Instinct MI250X GPUs / accelerators each equipped with 128 GB of HBM2e. Summed, the system has 602,112 CPU cores and 8,138,240 GPU cores in total, and 4.6 PB of both DDR4 and HBM2e.

In May, Frontier joined the TOP500 as the first supercomputer to break the exascale barrier after it completed the HPL benchmark with a score of 1.102 ExaFlops/s. Since then, the Oak Ridge National Laboratory in Tennessee, which manages the supercomputer, has been readying it for scientific research scheduled to start in January.

However, there have been reports that the launch of Frontier could be waylaid by excessive hardware failures. Seeking answers, Inside HPC organized an interview with the Program Director at Oak Ridge, Justin Whitt. In the interview, he confirmed Frontier was experiencing daily system failures but asserted that was inevitable in such a large system.

“Mean time between failure on a system this size is hours, it’s not days,” he said. “So you need to make sure you understand what those failures are and that there’s no patterns to those failures that you need to be concerned with.” Whitt added that going a day without a failure “would be outstanding.”

There were rumors that the hardware problems were being caused by the new AMD Instinct MI250X, but Whitt refuted them. The MI250X is AMD’s most powerful GPU/accelerator, and it only sells it to select partners. It has 220 CUs containing 14,080 cores clocked at 1700 MHz in a 500 W package.

“The issues span a lot of different categories, the GPUs are just one,” Whitt remarked. “It’s been a pretty good spread among common culprits of parts failures that have been a big part of it. I don’t think that at this point that we have a lot of concern over the AMD products,” he added.

“We’re dealing with a lot of the early-life kind of things we’ve seen with other machines that we’ve deployed, so it’s nothing too out of the ordinary.”

Whitt conceded that the unprecedented scale of Frontier had made fine tuning it “a little bit harder” but said they were still following the schedule set back in 2018-19 despite delays caused by the pandemic.