REDfish - REsilient Dynamic dIstributed Scalable System Services for Exescale

Hits: 3925
Type of Publication:
  • Greenberg, Hugh
  • Lang, Michael
  • Ionkov, Latchesar
  • Blanchard, Sean P.
Supercomputers are continually advancing in order to solve some of most challenging scientific problems. The petaflop (1015 floating point operations per second) performance milestone has been reached and researchers are now challenged with advancing the number of floating point operations to 1018 per second, also known as exascale. Exascale class systems are expected to contain millions of nodes consisting of low powered processor cores connected through multiple interconnects. System software used today was never designed to scale to these types of systems; therefore, a dramatic change is needed for system services to address the challenges of exascale. Services need to be resilient, dynamic, distributed, and scalable in order to scale to this type of system. To address these requirements for future system services, we describe a novel path to creating exascale-ready services by focusing on the key tenets of resilience, dynamic adaption, fully distributed processes. and scalability. We then present a DHCP (Dynamic Host Configuration Protocol) replacement based on this design and compare it to an existing DHCP implementation. We show that the dynamic allocation of services and the ability to absorb errors makes our approach superior to standard services.

© 2018 New Mexico Consortium