Co-designing the failure analysis and monitoring of large-scale systems

Abhishek Chandra; Rohini Prinja; Sourabh Jain; Zhi Li Zhang

doi:10.1145/1453175.1453178

Co-designing the failure analysis and monitoring of large-scale systems

Abhishek Chandra, Rohini Prinja, Sourabh Jain, Zhi Li Zhang

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Chapter

9 Scopus citations

Abstract

Large-scale distributed systems provide the backbone for numerous distributed applications and online services. These systems span over a multitude of computing nodes located at different geographical locations connected together via wide-area networks and overlays. A major concern with such systems is their susceptibility to failures leading to downtime of services and hence high monetary/business costs. In this paper, we argue that to understand failures in such a system, we need to co-design monitoring system with the failure analysis system. Unlike existing monitoring systems which are not designed specifically for failure analysis, we advocate a new way to design a monitoring system with the goal of uncovering causes of failures. Similarly the failure analysis techniques themselves need to go beyond simple statistical analysis of failure events in isolation to serve as an effective tool. Towards this end, we provide a discussion of some guiding principles for the co-design of monitoring and failure analysis systems for planetary scale systems.

Original language	English (US)
Title of host publication	Performance Evaluation Review
Publisher	Association for Computing Machinery
Pages	10-15
Number of pages	6
Volume	36
Edition	2
DOIs	https://doi.org/10.1145/1453175.1453178
State	Published - 2008

Access

10.1145/1453175.1453178

OpenUrl availability

Full text

Cite this

@inbook{d6ca84b7535a49188e764b26c01e21f6,

title = "Co-designing the failure analysis and monitoring of large-scale systems",

abstract = "Large-scale distributed systems provide the backbone for numerous distributed applications and online services. These systems span over a multitude of computing nodes located at different geographical locations connected together via wide-area networks and overlays. A major concern with such systems is their susceptibility to failures leading to downtime of services and hence high monetary/business costs. In this paper, we argue that to understand failures in such a system, we need to co-design monitoring system with the failure analysis system. Unlike existing monitoring systems which are not designed specifically for failure analysis, we advocate a new way to design a monitoring system with the goal of uncovering causes of failures. Similarly the failure analysis techniques themselves need to go beyond simple statistical analysis of failure events in isolation to serve as an effective tool. Towards this end, we provide a discussion of some guiding principles for the co-design of monitoring and failure analysis systems for planetary scale systems.",

author = "Abhishek Chandra and Rohini Prinja and Sourabh Jain and Zhang, {Zhi Li}",

year = "2008",

doi = "10.1145/1453175.1453178",

language = "English (US)",

volume = "36",

pages = "10--15",

booktitle = "Performance Evaluation Review",

publisher = "Association for Computing Machinery",

edition = "2",

}

TY - CHAP

T1 - Co-designing the failure analysis and monitoring of large-scale systems

AU - Chandra, Abhishek

AU - Prinja, Rohini

AU - Jain, Sourabh

AU - Zhang, Zhi Li

PY - 2008

Y1 - 2008

N2 - Large-scale distributed systems provide the backbone for numerous distributed applications and online services. These systems span over a multitude of computing nodes located at different geographical locations connected together via wide-area networks and overlays. A major concern with such systems is their susceptibility to failures leading to downtime of services and hence high monetary/business costs. In this paper, we argue that to understand failures in such a system, we need to co-design monitoring system with the failure analysis system. Unlike existing monitoring systems which are not designed specifically for failure analysis, we advocate a new way to design a monitoring system with the goal of uncovering causes of failures. Similarly the failure analysis techniques themselves need to go beyond simple statistical analysis of failure events in isolation to serve as an effective tool. Towards this end, we provide a discussion of some guiding principles for the co-design of monitoring and failure analysis systems for planetary scale systems.

AB - Large-scale distributed systems provide the backbone for numerous distributed applications and online services. These systems span over a multitude of computing nodes located at different geographical locations connected together via wide-area networks and overlays. A major concern with such systems is their susceptibility to failures leading to downtime of services and hence high monetary/business costs. In this paper, we argue that to understand failures in such a system, we need to co-design monitoring system with the failure analysis system. Unlike existing monitoring systems which are not designed specifically for failure analysis, we advocate a new way to design a monitoring system with the goal of uncovering causes of failures. Similarly the failure analysis techniques themselves need to go beyond simple statistical analysis of failure events in isolation to serve as an effective tool. Towards this end, we provide a discussion of some guiding principles for the co-design of monitoring and failure analysis systems for planetary scale systems.

UR - http://www.scopus.com/inward/record.url?scp=77956459832&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77956459832&partnerID=8YFLogxK

U2 - 10.1145/1453175.1453178

DO - 10.1145/1453175.1453178

M3 - Chapter

AN - SCOPUS:77956459832

VL - 36

SP - 10

EP - 15

BT - Performance Evaluation Review

PB - Association for Computing Machinery

ER -

Co-designing the failure analysis and monitoring of large-scale systems

Abstract

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this