Scholars@Duke publication: Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer

Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer

Publication , Conference

Zu, Y; Ghaffarkhah, A; Dang, HV; Towles, B; Hand, S; Huda, S; Bello, A; Kolbasov, A; Rezaei, A; Du, D; Lacy, S; Wang, H; Wisner, A; Lewis, C ...

Published in: Proceedings of the 21st Usenix Symposium on Networked Systems Design and Implementation Nsdi 2024

January 1, 2024

TPUv4 (Tensor Processing Unit) is Google’s 3rd generation accelerator for machine learning training, deployed as a 4096-node supercomputer with a custom 3D torus interconnect. In this paper, we describe our experience designing and operating the software infrastructure that allows TPUv4 supercomputers to operate at scale, including features for automatic fault resiliency and hardware recovery. We adopt a software-defined networking (SDN) approach to manage TPUv4’s high-bandwidth inter-chip interconnect (ICI) fabric, using optical circuit switching to dynamically configure routes to work around machine, chip and link failures. Our infrastructure detects failures and automatically triggers reconfiguration to minimize disruption to running workloads, as well as initiating remediation and repair workflows for the affected components. Similar techniques interface with maintenance and upgrade workflows for both hardware and software. Our dynamic reconfiguration approach allows our TPUv4 supercomputers to achieve 99.98% system availability, gracefully handling hardware outages experienced by ~1% of the training jobs.

Duke Scholars

Author Brian Towles Electrical and Computer Engineering

Published In

Proceedings of the 21st Usenix Symposium on Networked Systems Design and Implementation Nsdi 2024

Publication Date

January 1, 2024

Start / End Page

761 / 774

Citation

APA

Chicago

ICMJE

MLA

NLM

Zu, Y., Ghaffarkhah, A., Dang, H. V., Towles, B., Hand, S., Huda, S., … Bahini, H. (2024). Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer. In Proceedings of the 21st Usenix Symposium on Networked Systems Design and Implementation Nsdi 2024 (pp. 761–774).

Zu, Y., A. Ghaffarkhah, H. V. Dang, B. Towles, S. Hand, S. Huda, A. Bello, et al. “Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer.” In Proceedings of the 21st Usenix Symposium on Networked Systems Design and Implementation Nsdi 2024, 761–74, 2024.

Zu Y, Ghaffarkhah A, Dang HV, Towles B, Hand S, Huda S, et al. Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer. In: Proceedings of the 21st Usenix Symposium on Networked Systems Design and Implementation Nsdi 2024. 2024. p. 761–74.

Zu, Y., et al. “Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer.” Proceedings of the 21st Usenix Symposium on Networked Systems Design and Implementation Nsdi 2024, 2024, pp. 761–74.

Zu Y, Ghaffarkhah A, Dang HV, Towles B, Hand S, Huda S, Bello A, Kolbasov A, Rezaei A, Du D, Lacy S, Wang H, Wisner A, Lewis C, Bahini H. Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer. Proceedings of the 21st Usenix Symposium on Networked Systems Design and Implementation Nsdi 2024. 2024. p. 761–774.

Published In

Proceedings of the 21st Usenix Symposium on Networked Systems Design and Implementation Nsdi 2024

Publication Date

January 1, 2024

Start / End Page

761 / 774