Filter by:

Year: 2015

  • Guan Q, DeBardeleben N, Blanchard S, Fu S. 2015. Empirical Studies of the Soft Error Susceptibility of Sorting Algorithms. 5th Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop with HPDC 2015. [More]
  • Wang K, Zhou X, Qiao K, Lang M, McClelland B, Raicu I. 2015. Towards Scalable Distributed Workload Manager with Monitoring-Based Weakly Consistent Resource Stealing. ACM HPDC. [More]
  • Wang K, Qiao K, Sadooghi I, Zhou X, Li T, Lang M, Raicu I. 2015. Load-balanced and locality-aware scheduling for dataintensive workloads at extreme scales. CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE(00): 1-29. [More]
  • Sridharan V, DeBardeleben N, Blanchard S, Ferreira K, Stearley J, Shalf J, Gurumurthi S. 2015. Memory Errors in Modern Systems: The Good, The Bad, and the Ugly. Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems. [More]
  • Tiwari D, Gupta S, Rogers J, Maxwell D, Rech P, Vazhkudai S, Oliveira D, Londo D, DeBardeleben N, Navaux P and others. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operating. IEEE 21st International Symposium on High Performance Computer Architecture (HPCA): 331-342. [More]
  • Huang S, Fu S, DeBardeleben N, Guan Q, Xu C. 2015. Differentiated Failure Remediation with Action Selection for Resilient Computing. IEEE Pacific Rim International Symposium on Dependable Computing (PRDC). [More]
  • Guan Q, DeBardeleben N, Blanchard S, Fu S. 2015. Addressing Statistical Significance of Fault Injection: Empirical Studies of the Soft Error Susceptibility. IEEE Pacific Rim International Symposium on Dependable Computing(PRDC). [More]
  • Guan Q, DeBardeleben N, Atkinson B, Robey R, Jones W. 2015. Towards Building Resilience Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with CLAMR Hydrodynamics Mini-App. IEEE Cluster 2015. [More]

Year: 2014

  • Snir M, Wisniewski R, Abraham J, Adve S, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B and others. 2014. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications. [More]
  • DeBardeleben N, Blanchard S, Sridharan V, Gurumurthi S, Stearley J, Ferreira K, Shalf J. 2014. Extra Bits on SRAM and DRAM Errors - More Data From the Field. Silicon Errors in Logic - System Effects (SELSE-10), Stanford University. [More]
  • Bautista Gomez L, Cappello F, Carro L, DeBardeleben N, Fang B, Gurumurthi S, Pattabiraman K, Rech P, Sonza Reorda M. 2014. GPGPUs: How to Combine High Computational Power with High Reliability. Design, Automation & Test in Europe (DATE14), Dresden, Germany. [More]
  • Guan Q. 2014. F-SEFI: A Fine-grained Soft Error Fault Injector for Profiling Application Vulnerability. Poster presentation: LANL Predictive Science Panel Review, Los Alamos, NM. [More]
  • DeBardeleben N. 2014. Reliability Requirements for GPUs in HPC. HiPEAC 2014, Vienna, Austria. [More]
  • DeBardeleben N. 2014. Reliability Requirements for GPUs in HPC. Design, Automation & Test in Europe (DATE14), as part of "Embedded Tutorial: GPGPUs: how to combine high computational power with high reliability". [More]
  • Atkinson B, DeBardeleben N, Guan Q, Robey R, Jones WM. 2014. Fault Injection Experiments with the CLAMR Hydrodynamics Mini-App. Software Reliability Engineering Workshops (ISSREW), 2014 IEEE International Symposium: 6-9. [More]

Year: 2013

  • Ionkov L, Lang M, Maltzahn C. 2013. DRepl: Optimizing Access to Application Data for Analysis and Visualization. [More]
  • Yuan X, Mahapatra S, Lang M, Pakin S. 2013. RRR: A Load Balanced Routing Scheme for Slimmed Fat-trees. [More]
  • Pakin S, Lang M. 2013. Understanding the Performance of Two Production Supercomputers. [More]
  • Akkan H, Lang M, Liebrook L. 2013. Understanding and isolating the noise in the Linux kernel. International Journal of High Performance Computing Applications. [More]
  • Soltero P, Bridges P, Arnold D, Lang M. 2013. A Gossip-based Approach to Exascale System Services. [More]
  • Akkan H, Ionkov L, Lang M. 2013. Transparently Consistent Asynchronous Shared Memory. [More]
  • Pakin S, Lang M. 2013. Energy Modeling of Supercomputers and Large-Scale Scientific Applications. IEEE. [More]
  • Wang K, Kulkarni A, Lang M, Arnold D, Raicu I. 2013. Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services. [More]
  • Yuan X, Mahapatra S, Nienaber W, Pakin S, Lang M. 2013. A New Routing Scheme for Jellyfish and its Performance with HPC Workloads. Supercomputing Conference. [More]
  • Akkan H, Lang M, Ionkov L. 2013. HPC Runtime Support for Fast and Power Efficient Locking and Synchronization. IEEE. [More]
  • Pakin S, Luang X, Lang M. 2013. Predicting the performance of extreme-scale supercomputer networks. The Next Wave (http://www.nsa.gov/research/tnw/). 20(2). [More]
  • Huang B, Sass R, DeBardeleben N, Blanchard S. 2013. PyDac: A Resilient Run-time Framework for Divide-and-Conquer Applications on a Heterogeneous Many-core Architecture. Proceedings of the The 6th Workshop on UnConventional High Performance Computing 2013 (UCHPC 2013). [More]
  • DeBardeleben N, Blanchard S, Monroe L, Romero P, Grunau D, Idler C, Wright C. 2013. GPU Behavior on a Large HPC Cluster. 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the 19th International European Conference on Parallel and Distributed Computing (Euro-Par 2013), Aachen, Germany,. [More]
  • Jian X, Blanchard S, DeBardeleben N, Sridharan V, Kumar R. 2013. Reliability Models for Double Chipkill Detect/Correct Memory Systems. SELSE (Silicon Errors in Logic, System Effects): 6. [More]
  • Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B and others. 2013. Addressing Failures in Exascale Computing. Argonne National Laboratory Technical Report. [More]
  • Jian X, DeBardeleben N, Blanchard S, Sridharan V, Kumar R. 2013. Analyzing Reliability of Memory Subsystems with Double Chipkill Detect/Correct. The 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2013). Vancouver, BC, Canada. [More]
  • Sridharan V, Stearley J, DeBardeleben N, Blanchard S, Gurumurthi S. 2013. Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults. SC13, Denver Colorado. [More]

Year: 2012

  • Kulkarni A, Wang K, Lang M. 2012. Exploring the Design Tradeoffs for Exescale System Services Through Simulation. [More]
  • Kulkarni A, Lumsdaine A, Lang M, Ionkov L. 2012. Optimizing Latency and Throughput for Spawning Processes on Massively Multicore Processors. [More]
  • Akkan H, Lang M, Liebrook LM. 2012. Stepping Towards Noiseless Linux Environment. [More]
  • Kulkarni A, Manzanares A, Ionkov L, Lang M, Lumsdaine A. 2012. The Design and Implementation of a Multi-level Content-Addressable Checkpoint File System. [More]
  • Jones WM, Daly JT, DeBardeleben N. 2012. Application monitoring and checkpointing in HPC: looking towards exascale systems. Proceedings of the 50th Annual Southeast Regional Conference: 262-267. [More]
  • DeBardeleben N, Blanchard S, Guan Q, Zhang Z, Fu S. 2012. Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience. Euro-Par 2011: Parallel Processing Workshops Lecture Notes in Computer Science. 7156: 282-291. [More]
  • Geist A, Snir M, Roman E, Still B, Clay R, Engelmann C, Ross R, Schulz M, Krishnamoorthy S, Vishnu A and others. 2012. US Department of Energy Fault Management Workshop Report. [More]
  • Daly J, Harrod B, Hoang T, Nowell L, Adolf B, Borkar S, DeBardeleben N, Elnozahy M, Heroux M, Rogers D and others. 2012. Inter-Agency Workshop on HPC Resilience at Extreme Scale. [More]

Year: 2011

  • DeBardeleben N, Blanchard SP, Fu S, Guan Q, Zhang Z. 2011. Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience. [More]
  • Kulkarni A, Lang M, Lumsdaine A. 2011. GoDEL: A multidirectional dataflow execution model for large-scale computing. [More]
  • Ionkov L. 2011. Gostor: Storage beyond POSiX. [More]
  • Greenberg H, Lang M, Ionkov L, Blanchard SP. 2011. REDfish - REsilient Dynamic dIstributed Scalable System Services for Exescale. [More]
Results 1 - 44 of 44

© 2016 New Mexico Consortium