FATE and DESTINI: A Framework for Recovery Testing of Large-Scale Systems
Haryadi Gunawi, University of California - Berkeley, Nov 4, 2010
Large-scale computing and data storage systems are comprised of thousands of low-end machines and thus require sophisticated and complex distributed software to mask poor hardware reliability. A critical factor in the availability, reliability, and performance of large-scale systems is how they react to failure. Unfortunately, failure recovery has proven to be challenging in these systems, and practitioners continue to bemoan their inability to adequately address these recovery problems.
To address this issue, Dr. Gunawi will be presenting advancements in the current state-of-the-art of testing: FATE and DESTINI, a failure testing service and declarative testing specifications. FATE is designed to systematically push large-scale systems into thousands of possible failure scenarios. As FATE injects failures, DESTINI ensures that the target system still behaves correctly by comparing the specifications and the actual behavior. Dr. Gunawi and his collaborators have deployed their framework in three large-scale systems (HDFS, ZooKeeper, and Cassandra), explored over 40,000 failure scenarios, wrote 74 specifications(5 lines/spec), found 16 new bugs, and reproduced 51 old bugs. In the second part of his talk, Dr. Gunawi will briefly present the BOOM project (Berkeley Orders Of Magnitude).