Fault Localization in Large-Scale Computing Systems

Subscribers:
345,000
Published on ● Video Link: https://www.youtube.com/watch?v=jOxZfDC-fYM



Duration: 1:02:56
209 views
3


We describe a new fault localization technique for software bugs in large-scale computing systems. Our technique always collects per-process function call traces of a target system, and derives a concise execution model that reflects its normal function calling behaviors using the traces. To find the cause of a failure, we compare the derived model with the traces collected when the system failed, and compute a suspect score that quantifies how likely a particular part of call traces explains the failure. The execution model consists of a call probability of each function in the system that we estimate using the normal traces. Functions with low probabilities in the model give high anomaly scores when called upon a failure. Frequently-called functions in the model also give high scores when not called. Finally, we report the function call sequences ranked with the suspect scores to aid the human analyst in further localizing the fault to a small part of the overall system. We have applied our proposed method to fault localization of a known non-deterministic bug in a distributed parallel job manager. Experimental results on a three-site, 78-node distributed environment demonstrate that our method quickly locates an anomalous event that is highly correlated with the bug, indicating the effectiveness of our approach.




Other Videos By Microsoft Research


2016-09-06Class Morphing: Safely Shaping a Class in the Image of Others [1/3]
2016-09-06Deep Photo and Gigapixel Images
2016-09-06Automated Revision of Distributed and Real-Time Programs
2016-09-06Candidate talk: Knowledge Analysis towards Automatic Question Answering for Discussion Forums
2016-09-06Improving Data Recovery From Embedded Networked Sensing Systems with Fault Detection and Diagnosis
2016-09-06A Discriminative Kernel-based Model to Rank Images from Text Queries
2016-09-06Concurrency Simple and Safe? State of SCOOP
2016-09-06Automated Termination Analysis of Programs using Term Rewriting
2016-09-06Abstraction Methods for Liveness
2016-09-06Effective and Efficient User Interaction for Long Queries [1/17]
2016-09-06Fault Localization in Large-Scale Computing Systems
2016-09-06Delegatable Anonymous Credentials
2016-09-06Enhancing the P racticality and R eachability of Interactive Technology
2016-09-06Computational methods for the detection of positive and lineage-specific selection from genomic data
2016-09-06State of the Art and Future Trends in Mobile Phone-based Augmented Reality
2016-09-06Candidate talk: Computing Nash Equilibria
2016-09-06Exploiting Hardware/Software Interactions for Embedded Systems Design
2016-09-06The Semantic Web in Action
2016-09-06Leveraging Fine-Grained Multithreading for Efficient SIMD Control Flow
2016-09-06I-Room - Intelligent Collaborative Spaces for Emergency Response [1/3]
2016-09-06Query Lower Bounds for Matroids via Group Representations



Tags:
microsoft research