Evaluating the Fault-Tolerance of the CLAMR Hydrodynamics Mini-App with the F-SEFI Fault Injector

William M. Jones, Nathan DeBardeleben, Qiang Guan, Robert Robey, Brian Atkinson

This talk outlines:

  • The needs for reliable supercomputing
  • Overview of F-SEFI fault injection framework
  • Overview of CLAMR hydrodynamics mini-app
  • Fault injection experiments
  • Examples of fault injection, visualizations, movies
  • Conclusions and future work

The pdf can be found here

Integrated Optimization of Performance, Power and Resilience for Extreme Scale Systems

This talk by Nathan DeBardeleben of the USRC was held at SC'14. The pdf can be found here

FATE and DESTINI: A Framework for Recovery Testing of Large-Scale Systems

Haryadi Gunawi, University of California - Berkeley, Nov 4, 2010

Large-scale computing and data storage systems are comprised of thousands of low-end machines and thus require sophisticated and complex distributed software to mask poor hardware reliability. A critical factor in the availability, reliability, and performance of large-scale systems is how they react to failure. Unfortunately, failure recovery has proven to be challenging in these systems, and practitioners continue to bemoan their inability to adequately address these recovery problems.

To address this issue, Dr. Gunawi will be presenting advancements in the current state-of-the-art of testing: FATE and DESTINI, a failure testing service and declarative testing specifications. FATE is designed to systematically push large-scale systems into thousands of possible failure scenarios. As FATE injects failures, DESTINI ensures that the target system still behaves correctly by comparing the specifications and the actual behavior. Dr. Gunawi and his collaborators have deployed their framework in three large-scale systems (HDFS, ZooKeeper, and Cassandra), explored over 40,000 failure scenarios, wrote 74 specifications(5 lines/spec), found 16 new bugs, and reproduced 51 old bugs. In the second part of his talk, Dr. Gunawi will briefly present the BOOM project (Berkeley Orders Of Magnitude).

The Value of Supercomputing Field Data

This talk by Nathan DeBardeleben of the USRC goes over:

  • Why we need field data
  • What are LANL's/DOE's goals
  • What have we learned from our relationship with AMD so far
  • Where do we go from here

See the PDF of the talk here

Cache Injection for Parallel Applications

Edgar Leon, IBM Austin Research Laboratory, Dec 15, 2010

For two decades, the memory wall has affected many applications in their ability to benefit from improvements in processor speed. Cache injection addresses this disparity for I/O by writing data into a processor’s cache directly from the I/O bus. This technique reduces data latency and, unlike data prefetching, improves memory bandwith utilization. These improvements are significant for data intensive applications whose performance is dominated by compulsory cache misses. In this talk, Dr. Leon presents a detailed evaluation of three injection policies and their effect on the performance of two parallel applications and several collective micro-benchmarks. He will demonstrate that the effectiveness of cache injection on performance is a function of the communication characteristics of applications, the injection policy, the target cache, and the severity of the memory wall. For example, he will show that injecting message payloads to the L3 cache can improve the performance of network-bandwith limited applications. In addition, Dr. Leon will show that cache injection improves the performance of several collective operations, but not all-to-all operations (implementation dependent). This study shows negligible pollution to the target caches.

 

© 2018 New Mexico Consortium