GPU Behavior on a Large HPC Cluster

Hits: 3375
Type of Publication:
  • DeBardeleben, Nathan
  • Blanchard, Sean
  • Monroe, Laura
  • Romero, Phil
  • Grunau, Daryl
  • Idler, Craig
  • Wright, Cornwell
6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the 19th International European Conference on Parallel and Distributed Computing (Euro-Par 2013), Aachen, Germany,
August 26-30
We discuss observed characteristics of GPUs deployed as ac- celerators in an HPC cluster at Los Alamos National Laboratory. GPUs have a very good theoretical FLOPS rate, and are reasonably inexpen- sive and available, but they are relatively new to HPC, which demands both consistently high performance across nodes and consistently low error rate. We modified a standard acceptance procedure to test GPU performance, error rate and reliability characteristics, and ran the test suite on a Fermi HPC cluster at LANL. We discuss here our methodology for this test- ing, and present results relevant to the deployment of GPUs in an HPC environment. In this paper we show performance variability, power usage variability (possibly related), and some reliability concerns on the GPUs tested. We argue for rigorous testing of these devices in deployment as a way of characterizing their behavior.

© 2018 New Mexico Consortium