Skip to main content
construction release_alert
The Scholars Team is working with OIT to resolve some issues with the Scholars search index
cancel

Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer

Publication ,  Conference
Zu, Y; Ghaffarkhah, A; Dang, HV; Towles, B; Hand, S; Huda, S; Bello, A; Kolbasov, A; Rezaei, A; Du, D; Lacy, S; Wang, H; Wisner, A; Lewis, C ...
Published in: Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
January 1, 2024

TPUv4 (Tensor Processing Unit) is Google’s 3rd generation accelerator for machine learning training, deployed as a 4096-node supercomputer with a custom 3D torus interconnect. In this paper, we describe our experience designing and operating the software infrastructure that allows TPUv4 supercomputers to operate at scale, including features for automatic fault resiliency and hardware recovery. We adopt a software-defined networking (SDN) approach to manage TPUv4’s high-bandwidth inter-chip interconnect (ICI) fabric, using optical circuit switching to dynamically configure routes to work around machine, chip and link failures. Our infrastructure detects failures and automatically triggers reconfiguration to minimize disruption to running workloads, as well as initiating remediation and repair workflows for the affected components. Similar techniques interface with maintenance and upgrade workflows for both hardware and software. Our dynamic reconfiguration approach allows our TPUv4 supercomputers to achieve 99.98% system availability, gracefully handling hardware outages experienced by ~1% of the training jobs.

Duke Scholars

Published In

Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024

Publication Date

January 1, 2024

Start / End Page

761 / 774
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Zu, Y., Ghaffarkhah, A., Dang, H. V., Towles, B., Hand, S., Huda, S., … Bahini, H. (2024). Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024 (pp. 761–774).
Zu, Y., A. Ghaffarkhah, H. V. Dang, B. Towles, S. Hand, S. Huda, A. Bello, et al. “Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer.” In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024, 761–74, 2024.
Zu Y, Ghaffarkhah A, Dang HV, Towles B, Hand S, Huda S, et al. Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer. In: Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024. 2024. p. 761–74.
Zu, Y., et al. “Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer.” Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024, 2024, pp. 761–74.
Zu Y, Ghaffarkhah A, Dang HV, Towles B, Hand S, Huda S, Bello A, Kolbasov A, Rezaei A, Du D, Lacy S, Wang H, Wisner A, Lewis C, Bahini H. Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer. Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024. 2024. p. 761–774.

Published In

Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024

Publication Date

January 1, 2024

Start / End Page

761 / 774