This paper describes our work on Distalyzer: a tool for automatically diagnosing performance problems in distributed systems. It was accepted for publication at NSDI 2012, and is work done by Karthik Nagaraj, Charles Killian and Jennifer Neville.
Diagnosis and correction of performance issues in modern, large-scale distributed systems can be a daunting task, since a single developer is unlikely to be familiar with the entire system and it is hard to characterize the behavior of a software system without completely understanding its internal components. Moreover, distributed systems are extremely complex because of the innate complexity of their code, combined with the network that can cause unpredictable delays and orderings.
This paper describes Distalyzer, an automated tool to support developer investigation of performance issues in distributed systems. We aim to leverage the vast log data available from large scale systems, while reducing the level of knowledge required for a developer to use our tool. Specifically, given two sets of logs, one with good and one with bad performance, Distalyzer uses machine learning techniques to compare system behaviors extracted from the logs and automatically infer the strongest associations between system components and performance.
We’ve released the source code and logs used in the paper, and put up the final version at Distalyzer’s webpage.