Conventional distributed systems wisdom is to treat slow nodes the same as failed nodes, through the use of leases and timeouts, handling merely slow nodes by effectively rebooting or restarting them.
While this wisdom has led to simpler systems that can effectively handle high churn while also continuing to function in the face of high network delays, it is ill suited to more modern managed environments, where many processes are co-located in a small geographic region such as a campus or data center. In these environments, delays are commonly just that—delays. Moreover, a failed process is likely to be restarted quite quickly, so that if state is effectively maintained, it need not be erased as it would by conventional wisdom.
In our paper, we propose that conventional wisdom be rebooted for managed distributed systems: that we should instead treat failed nodes as slow nodes, allowing the system to more gracefully handle common scenarios such as power failures.
We present Ken protocol, a lightweight distributed reliability model that is well suited to developing highly-survivable applications for these environments and allows the developer to focus on the crash-free behavior of applications.
We also demonstrate how this model can be effortlessly integrated with Mace toolkits for building large-scale distributed systems, yielding MaceKen, to support survivable application development. Moreover, this model allows multiple, independently developed application components to be seamlessly composed, preserving the reliability benefits across systems.
Our work will be released as an open source in soon. Stay tuned.
Sunghwan Yoo, Charles Killian, Terence Kelly, Hyoun Kyu Cho, and Steven Plite. Composable Reliability for Asynchronous Systems: Treating Failures as Slow Processes, In proceedings of 2012 USENIX Annual Technical Conference (USENIX ATC ’12). Boston, MA. (to appear) 13-15 June, 2012.
