Netpilot: Automating datacenter network failure mitigation
The soaring demands for always-on and fast-response online services have driven modern datacenter networks to undergo tremendous growth. These networks often rely on scale-out designs with large numbers of commodity switches to reach immense capacity while keeping capital expenses under check. The downside is more devices means more failures, raising a formidable challenge for network operators to promptly handle these failures with minimal disruptions to the hosted services. Recent research efforts have focused on automatic failure localization. Yet. resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work. NetPilot aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do - by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. We demonstrate that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks. Copyright 2012 ACM.
Duke Scholars
Altmetric Attention Stats
Dimensions Citation Stats
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Issue
Start / End Page
Related Subject Headings
- Networking & Telecommunications
- 4606 Distributed computing and systems software
- 4006 Communications engineering
- 1005 Communications Technologies
- 0805 Distributed Computing
- 0803 Computer Software
Citation
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Issue
Start / End Page
Related Subject Headings
- Networking & Telecommunications
- 4606 Distributed computing and systems software
- 4006 Communications engineering
- 1005 Communications Technologies
- 0805 Distributed Computing
- 0803 Computer Software