Finding Latent Performance Bugs in Systems Implementations (FSE 2010)

Performance is one of the later but important component in building efficient Distributed systems. Precise performance guarantees are essential for high throughput and time sensitive distributed applications, and hence developers have to chase performance. Most of the time, performance problems are uncovered in non-ideal conditions (such as delayed or dropped packets, node failures, network partitions, etc.) making it hard to identify problems especially in complex interleaved distributed executions. Moreover, common development approaches adopt periodic recovery mechanisms that eventually bring protocol correctness and node consistency. But this strategy is detrimental to performance debugging as it conceals the inability (of the protocol) to attain consistency in their absence. We refer to such problems as latent performance anomalies, which have serious design and implementation problems with big implications to responsiveness, availability and end-to-end behavior of the system.

In our FSE paper, we present MacePC – that automates the discovery of latent performance problems in distributed systems through a systematic analysis of the system. Broadly, MacePC first identifies representative execution performance of the distributed system, following which it sets out to explore the state space to identify executions that lead to anomalous run times. Any sequence of events that leads to a very large (or small) run time is flagged for developer inspection, along with a point in the execution that led to the bad performance. We implemented MacePC over Mace and evaluated it across 5 mature implementations of popular distributed protocols. In this exercise, we identified (previously unknown) bugs in the protocol/implementation that cause poor performance.

Firstly, the system is executed in a simulated environment (that imitates the intended experiment scenario) to obtain probability distributions indicative of common executions. We refer to these as Event Duration Distributions (EDDs), gathered for each type of event in the execution. The important feature of EDDs is that an execution (sequence of events) can be deterministically pinned to a total run time using just a seeded random function. This is achieved by picking a time for each event in the execution from the EDD for that event, and computing the sum of event times. Hence we can generate any number of executions of the system in the simulated environment and compute their associated run times. MacePC generates a sketch of the distributed system performance as a distribution of the run times. An execution is deemed anomalous if it falls on either side of the first or third quartile of this distribution. The next step is to identify such executions that lead to bad performance and also describe a sequence of events as evidence.

The final step in identifying problematic executions is through random state space exploration. With a large number of nodes and complex distributed protocol, the number of possible intermediate states in the system is exponential and is hence infeasible to systematically explore all valid executions. During random exploration, the simulation considers random choices for event interleaving and arrives at the end of an execution (the safety and liveness properties satisfied). If such an execution path is considered to be anomalous, we use an exponential technique to identify the exact event on the path that is problematic.

Evaluation: We have an extensive evaluation of MacePC on five distributed system implementations: Bullet’, Pastry, Bamboo, Chord and RandTree (a random tree protocol), which have been tested for correctness using MaceMC. As a sample, we describe our experiences with Bamboo here, which is an overlay ring construction protocol based on Pastry. As a standard for evaluating the performance bugs we used an experiment similar to the one described in the Bamboo paper, where the lookup consistency is measured in the presence of node churn. We identified two bugs in Bamboo, both at the time of a node join dealing with the consistency of state maintained about this node at other nodes. Both the bugs were masked by maintenance timers that eventually caught up and fixed the inconsistencies. After fixing the bugs found by MacePC, Bamboo was able to deliver better consistency, eliminating roughly 15-30% of inconsistent lookups, while also reducing latency by up to 15% under high degrees of churn. These results substantiate the importance of these performance bugs in real life situations, not just arcane nuances tickled by rare execution paths.

In conclusion, we presented MacePC – an automated tool for systematically searching and identifying latent performance bugs in system implementations. Random state space exploration allows for a feasible and quick search of the execution space to find execution paths of interest to the developer. We have shown the use of MacePC by applying it to five distributed systems, finding new performance bugs in all of them. If you like this work, we definitely want to hear from you. The code will be available from the Mace distribution very soon.

Charles Killian, Karthik Nagaraj, Salman Pervez, Ryan Braud, James W. Anderson and Ranjit Jhala. Finding Latent Performance Bugs in Systems Implementations at 18th International Symposium on the Foundations of Software Engineering (FSE 2010). PDF ACM Portal

This entry was posted in Papers. Bookmark the permalink.

2 Responses to Finding Latent Performance Bugs in Systems Implementations (FSE 2010)

  1. Pingback: Chip's Technical Blog

  2. Pingback: MacePC publicly available | MaceSystems

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>