## Year: 2017

- Tan L, DeBardeleben N, Guan Q, Blanchard S, Lang M. 2017. RSVP: Soft Error Resilient Power Savings at Near-Threshold Voltage using Register Vulnerability. the 3rd International Workshop on Recent Advances in the DependabIlity AssessmeNt of Complex systEms (RADIANCE). [More]
- Tan L, DeBardeleben N, Guan Q, Blanchard S, Lang M. 2017. Using Virtualization to Quantify Power Conservation via Near-Threshold Voltage Reduction for Inherently Resilient Applications. Parallel Computing. [More]
- Otstott D, Ionkov L, Lang M, Zhao M. 2017. TCASM: An asynchronous shared memory interface for high-performance application composition. Parallel Computing. 63: 61-78. [More]
- Wu P, DeBardeleben N, Guan Q, Blanchard S, Chen J, Tao D, Liang X, Ouyang K, Chen Z. 2017. Silent Data Corruption Resilient Two-sided Matrix Factorizations. Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 415–427, ACM, Austin, Texas, USA, 2017, ISBN: 978-1-4503-4493-7.. [More]

## Year: 2016

- Baseman E, Blanchard S, Li Z, Fu S. 2016. Relational Synthesis of Text and Numeric Data for Anomaly Detection on Computing System Logs. 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 882-885. [More]
- Baseman E, DeBardeleben N, Ferreira K, Levy S, Raasch S, Sridharan V, Siddiqua T, Guan Q. 2016. Improving DRAM Fault Characterization through Machine Learning. 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W), pp. 250-253.. [More]
- Fang B, Wu P, Guan Q, DeBardeleben N, Monroe L, Blanchard S, Chen Z, Pattabiraman K, Ripeanu M. 2016. SDC is in the Eye of the Beholder: A Survey and Preliminary Study. 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W), pp. 72-76. [More]
- DeBardeleben N. 2016. Extreme scale and bleeding edge technology lead to a need for resilient high performance computing systems. 2016 IEEE International Reliability Physics Symposium (IRPS), pp. 3B-1-1-3B-1-8. [More]
- Wu P, Guan Q, DeBardeleben N, Blanchard S, Tao D, Liang X, Chen J, Chen Z. 2016. Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra. Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, pp. 31–42, ACM, Kyoto, Japan, 2016, ISBN: 978-1-4503-4314-5.. [More]
- Fang B, Wu P, Guan Q, DeBardeleben N, Monroe L, Blanchard S, Chen Z, Pattabiraman K, Ripeanu M. 2016. SDC is in the Eye of the Beholder: A Survey and Preliminary Study. 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN Workshops 2016, Toulouse, France, June 28 - July 1, 2016, pp. 72–76. [More]
- Monroe L, Jones WM, Lavigne SR, IV CD, Guan Q, DeBardeleben N. 2016. On the Inherent Resilience of Integer Operations. Euro-Par 2016: Parallel Processing Workshops - Euro-Par 2016 International Workshops, Grenoble, France, August 24-26, 2016, Revised Selected Papers, pp. 648–659. [More]

## Year: 2015

- Guan Q, DeBardeleben N, Blanchard S, Fu S. 2015. Empirical Studies of the Soft Error Susceptibility of Sorting Algorithms. 5th Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop with HPDC 2015. [More]
- Wang K, Zhou X, Qiao K, Lang M, McClelland B, Raicu I. 2015. Towards Scalable Distributed Workload Manager with Monitoring-Based Weakly Consistent Resource Stealing. ACM HPDC. [More]
- Wang K, Qiao K, Sadooghi I, Zhou X, Li T, Lang M, Raicu I. 2015. Load-balanced and locality-aware scheduling for dataintensive workloads at extreme scales. CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE(00): 1-29. [More]
- Sridharan V, DeBardeleben N, Blanchard S, Ferreira K, Stearley J, Shalf J, Gurumurthi S. 2015. Memory Errors in Modern Systems: The Good, The Bad, and the Ugly. Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems. [More]
- Tiwari D, Gupta S, Rogers J, Maxwell D, Rech P, Vazhkudai S, Oliveira D, Londo D, DeBardeleben N, Navaux P and others. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operating. IEEE 21st International Symposium on High Performance Computer Architecture (HPCA): 331-342. [More]
- Huang S, Fu S, DeBardeleben N, Guan Q, Xu C. 2015. Differentiated Failure Remediation with Action Selection for Resilient Computing. IEEE Pacific Rim International Symposium on Dependable Computing (PRDC). [More]
- Guan Q, DeBardeleben N, Blanchard S, Fu S. 2015. Addressing Statistical Significance of Fault Injection: Empirical Studies of the Soft Error Susceptibility. IEEE Pacific Rim International Symposium on Dependable Computing(PRDC). [More]
- Guan Q, DeBardeleben N, Atkinson B, Robey R, Jones W. 2015. Towards Building Resilience Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with CLAMR Hydrodynamics Mini-App. IEEE Cluster 2015. [More]
- DeBardeleben N, Blanchard S, Kaeli D, Rech P. 2015. Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design. 2015 IEEE 33rd VLSI Test Symposium (VTS), pp. 1-2, 2015, ISSN: 1093-0167. [More]

## Year: 2014

- Snir M, Wisniewski R, Abraham J, Adve S, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B and others. 2014. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications. [More]
- DeBardeleben N, Blanchard S, Sridharan V, Gurumurthi S, Stearley J, Ferreira K, Shalf J. 2014. Extra Bits on SRAM and DRAM Errors - More Data From the Field. Silicon Errors in Logic - System Effects (SELSE-10), Stanford University. [More]
- Bautista Gomez L, Cappello F, Carro L, DeBardeleben N, Fang B, Gurumurthi S, Pattabiraman K, Rech P, Sonza Reorda M. 2014. GPGPUs: How to Combine High Computational Power with High Reliability. Design, Automation & Test in Europe (DATE14), Dresden, Germany. [More]
- Guan Q. 2014. F-SEFI: A Fine-grained Soft Error Fault Injector for Profiling Application Vulnerability. Poster presentation: LANL Predictive Science Panel Review, Los Alamos, NM. [More]
- DeBardeleben N. 2014. Reliability Requirements for GPUs in HPC. HiPEAC 2014, Vienna, Austria. [More]
- DeBardeleben N. 2014. Reliability Requirements for GPUs in HPC. Design, Automation & Test in Europe (DATE14), as part of "Embedded Tutorial: GPGPUs: how to combine high computational power with high reliability". [More]
- Atkinson B, DeBardeleben N, Guan Q, Robey R, Jones WM. 2014. Fault Injection Experiments with the CLAMR Hydrodynamics Mini-App. Software Reliability Engineering Workshops (ISSREW), 2014 IEEE International Symposium: 6-9. [More]

## Year: 2013

- Ionkov L, Lang M, Maltzahn C. 2013. DRepl: Optimizing Access to Application Data for Analysis and Visualization. [More]
- Yuan X, Mahapatra S, Lang M, Pakin S. 2013. RRR: A Load Balanced Routing Scheme for Slimmed Fat-trees. [More]
- Pakin S, Lang M. 2013. Understanding the Performance of Two Production Supercomputers. [More]
- Akkan H, Lang M, Liebrook L. 2013. Understanding and isolating the noise in the Linux kernel. International Journal of High Performance Computing Applications. [More]
- Soltero P, Bridges P, Arnold D, Lang M. 2013. A Gossip-based Approach to Exascale System Services. [More]
- Akkan H, Ionkov L, Lang M. 2013. Transparently Consistent Asynchronous Shared Memory. [More]
- Pakin S, Lang M. 2013. Energy Modeling of Supercomputers and Large-Scale Scientific Applications. IEEE. [More]
- Wang K, Kulkarni A, Lang M, Arnold D, Raicu I. 2013. Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services. [More]
- Yuan X, Mahapatra S, Nienaber W, Pakin S, Lang M. 2013. A New Routing Scheme for Jellyfish and its Performance with HPC Workloads. Supercomputing Conference. [More]
- Akkan H, Lang M, Ionkov L. 2013. HPC Runtime Support for Fast and Power Efficient Locking and Synchronization. IEEE. [More]
- Pakin S, Luang X, Lang M. 2013. Predicting the performance of extreme-scale supercomputer networks. The Next Wave (http://www.nsa.gov/research/tnw/). 20(2). [More]
- Huang B, Sass R, DeBardeleben N, Blanchard S. 2013. PyDac: A Resilient Run-time Framework for Divide-and-Conquer Applications on a Heterogeneous Many-core Architecture. Proceedings of the The 6th Workshop on UnConventional High Performance Computing 2013 (UCHPC 2013). [More]
- DeBardeleben N, Blanchard S, Monroe L, Romero P, Grunau D, Idler C, Wright C. 2013. GPU Behavior on a Large HPC Cluster. 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the 19th International European Conference on Parallel and Distributed Computing (Euro-Par 2013), Aachen, Germany,. [More]
- Jian X, Blanchard S, DeBardeleben N, Sridharan V, Kumar R. 2013. Reliability Models for Double Chipkill Detect/Correct Memory Systems. SELSE (Silicon Errors in Logic, System Effects): 6. [More]
- Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B and others. 2013. Addressing Failures in Exascale Computing. Argonne National Laboratory Technical Report. [More]
- Jian X, DeBardeleben N, Blanchard S, Sridharan V, Kumar R. 2013. Analyzing Reliability of Memory Subsystems with Double Chipkill Detect/Correct. The 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2013). Vancouver, BC, Canada. [More]
- Sridharan V, Stearley J, DeBardeleben N, Blanchard S, Gurumurthi S. 2013. Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults. SC13, Denver Colorado. [More]

## Year: 2012

- Kulkarni A, Wang K, Lang M. 2012. Exploring the Design Tradeoffs for Exescale System Services Through Simulation. [More]
- Kulkarni A, Lumsdaine A, Lang M, Ionkov L. 2012. Optimizing Latency and Throughput for Spawning Processes on Massively Multicore Processors. [More]
- Akkan H, Lang M, Liebrook LM. 2012. Stepping Towards Noiseless Linux Environment. [More]
- Kulkarni A, Manzanares A, Ionkov L, Lang M, Lumsdaine A. 2012. The Design and Implementation of a Multi-level Content-Addressable Checkpoint File System. [More]
- Jones WM, Daly JT, DeBardeleben N. 2012. Application monitoring and checkpointing in HPC: looking towards exascale systems. Proceedings of the 50th Annual Southeast Regional Conference: 262-267. [More]
- DeBardeleben N, Blanchard S, Guan Q, Zhang Z, Fu S. 2012. Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience. Euro-Par 2011: Parallel Processing Workshops Lecture Notes in Computer Science. 7156: 282-291. [More]
- Geist A, Snir M, Roman E, Still B, Clay R, Engelmann C, Ross R, Schulz M, Krishnamoorthy S, Vishnu A and others. 2012. US Department of Energy Fault Management Workshop Report. [More]
- Daly J, Harrod B, Hoang T, Nowell L, Adolf B, Borkar S, DeBardeleben N, Elnozahy M, Heroux M, Rogers D and others. 2012. Inter-Agency Workshop on HPC Resilience at Extreme Scale. [More]

## Year: 2011

- DeBardeleben N, Blanchard SP, Fu S, Guan Q, Zhang Z. 2011. Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience. [More]
- Kulkarni A, Lang M, Lumsdaine A. 2011. GoDEL: A multidirectional dataflow execution model for large-scale computing. [More]
- Ionkov L. 2011. Gostor: Storage beyond POSiX. [More]
- Greenberg H, Lang M, Ionkov L, Blanchard SP. 2011. REDfish - REsilient Dynamic dIstributed Scalable System Services for Exescale. [More]

Results 1 - 56 of 56