Skip to main content

NetPilot: Automating datacenter network failure mitigation

Publication ,  Journal Article
Wu, X; Turner, D; Chen, CC; Maltz, DA; Yang, X; Yuan, L; Zhang, M
Published in: SIGCOMM'12 - Proceedings of the ACM SIGCOMM 2012 Conference Applications, Technologies, Architectures, and Protocols for Computer Communication
September 26, 2012

Driven by the soaring demands for always-on and fast-response online services, modern datacenter networks have recently undergone tremendous growth. These networks often rely on commodity hardware to reach immense scale while keeping capital expenses under check. The downside is that commodity devices are prone to failures, raising a formidable challenge for network operators to promptly handle these failures with minimal disruptions to the hosted services. Recent research efforts have focused on automatic failure localization. Yet, resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work, NetPilot aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do - by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. We demonstrate that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks. © 2012 ACM.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

SIGCOMM'12 - Proceedings of the ACM SIGCOMM 2012 Conference Applications, Technologies, Architectures, and Protocols for Computer Communication

DOI

Publication Date

September 26, 2012

Start / End Page

419 / 430
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Wu, X., Turner, D., Chen, C. C., Maltz, D. A., Yang, X., Yuan, L., & Zhang, M. (2012). NetPilot: Automating datacenter network failure mitigation. SIGCOMM’12 - Proceedings of the ACM SIGCOMM 2012 Conference Applications, Technologies, Architectures, and Protocols for Computer Communication, 419–430. https://doi.org/10.1145/2342356.2342438
Wu, X., D. Turner, C. C. Chen, D. A. Maltz, X. Yang, L. Yuan, and M. Zhang. “NetPilot: Automating datacenter network failure mitigation.” SIGCOMM’12 - Proceedings of the ACM SIGCOMM 2012 Conference Applications, Technologies, Architectures, and Protocols for Computer Communication, September 26, 2012, 419–30. https://doi.org/10.1145/2342356.2342438.
Wu X, Turner D, Chen CC, Maltz DA, Yang X, Yuan L, et al. NetPilot: Automating datacenter network failure mitigation. SIGCOMM’12 - Proceedings of the ACM SIGCOMM 2012 Conference Applications, Technologies, Architectures, and Protocols for Computer Communication. 2012 Sep 26;419–30.
Wu, X., et al. “NetPilot: Automating datacenter network failure mitigation.” SIGCOMM’12 - Proceedings of the ACM SIGCOMM 2012 Conference Applications, Technologies, Architectures, and Protocols for Computer Communication, Sept. 2012, pp. 419–30. Scopus, doi:10.1145/2342356.2342438.
Wu X, Turner D, Chen CC, Maltz DA, Yang X, Yuan L, Zhang M. NetPilot: Automating datacenter network failure mitigation. SIGCOMM’12 - Proceedings of the ACM SIGCOMM 2012 Conference Applications, Technologies, Architectures, and Protocols for Computer Communication. 2012 Sep 26;419–430.

Published In

SIGCOMM'12 - Proceedings of the ACM SIGCOMM 2012 Conference Applications, Technologies, Architectures, and Protocols for Computer Communication

DOI

Publication Date

September 26, 2012

Start / End Page

419 / 430