Inter-Agency Workshop on HPC Resilience at Extreme Scale

Hits: 3550
Type of Publication:
  • Daly, John
  • Harrod, Bill
  • Hoang, Thuc
  • Nowell, Lucy
  • Adolf, Bob
  • Borkar, Shekhar
  • DeBardeleben, Nathan
  • Elnozahy, Mootaz
  • Heroux, Mike
  • Rogers, David
  • Ross, Rob
  • Sarkar, Vivek
  • Schulz, Martin
  • Snir, Mark
  • Woodward, Paul
  • Aulwes, Rob
  • Bancroft, Marti
  • Bronevetsky, Greg
  • Carlson, Bill
  • Geist, Al
  • Hall, Mary
  • Hollingsworth, Jeff
  • Lucas, Bob
  • Lumsdaine, Andrew
  • Macaluso, Tina
  • Quinlan, Dan
  • Sachs, Sonia
  • Shalf, John
  • Smith, Tom
  • Stearley, Jon
  • Still, Bert
  • Wu, John
National Security Agency, Advanced Computing Systems, February 21-24, 2012
The following report summarizes the proceedings of a three-and-a-half day inter-agency work- shop focused on the technical challenges of HPC resilience in the 2020 Exascale timeframe. The resilience problem is not specific to any particular program or agency; coordinated resilience solutions will be challenging because of the need for a truly integrated approach. The inter- agency workshop therefore focused on articulating practical, synergetic R&D goals by assembling a small but diverse group of experts representing system hardware, system software, application developers and users, algorithms and libraries, file systems, I/O and storage, visualization and data analytics for a collective deep dive on the problem of resilience. The workshop format was highly interactive, focused on problem solving teams of not more than ten persons each. Partic- ipants were tasked to collaboratively develop a plan and roadmap for implementing resilience at extreme scale, resulting in “proof of concept” strategies for resilience on future, general purpose HPC systems in the application domains of “predictive science” and “not predictive science”. Those strategies were analyzed in the context of future Exascale requirements relative to power, performance, reliability, usability, dependability and time-to-solution. That analysis consisted of an assessment of current capabilities, gaps and dependencies culminating in a strawman R&D roadmap for an integrated resilience framework. These outcomes demonstrate both the need for and existence of practical resilience strategies that address the future needs of applications within the constraints of future Exascale technology.

© 2018 New Mexico Consortium