Scholars@Duke publication: Hoplite: Efficient and fault-tolerant collective communication for task-based distributed systems

Hoplite: Efficient and fault-tolerant collective communication for task-based distributed systems

Publication , Conference

Zhuang, S; Li, Z; Zhuo, D; Wang, S; Liang, E; Nishihara, R; Moritz, P; Stoica, I

Published in: SIGCOMM 2021 Proceedings of the ACM SIGCOMM 2021 Conference

August 9, 2021

Task-based distributed frameworks (e.g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving. As more data-intensive applications move to run on top of task-based systems, collective communication efficiency has become an important problem. Unfortunately, traditional collective communication libraries (e.g., MPI, Horovod, NCCL) are an ill fit, because they require the communication schedule to be known before runtime and they do not provide fault tolerance. We design and implement Hoplite, an efficient and fault-tolerant collective communication layer for task-based distributed systems. Our key technique is to compute data transfer schedules on the fly and execute the schedules efficiently through fine-grained pipelining. At the same time, when a task fails, the data transfer schedule adapts quickly to allow other tasks to keep making progress. We apply Hoplite to a popular task-based distributed framework, Ray. We show that Hoplite speeds up asynchronous stochastic gradient descent, reinforcement learning, and serving an ensemble of machine learning models that are difficult to execute efficiently with traditional collective communication by up to 7.8x, 3.9x, and 3.3x, respectively.

Duke Scholars

Author Danyang Zhuo Computer Science

Published In

SIGCOMM 2021 Proceedings of the ACM SIGCOMM 2021 Conference

DOI

10.1145/3452296.3472897

Publication Date

August 9, 2021

Start / End Page

641 / 656

Citation

APA

Chicago

ICMJE

MLA

NLM

Zhuang, S., Li, Z., Zhuo, D., Wang, S., Liang, E., Nishihara, R., … Stoica, I. (2021). Hoplite: Efficient and fault-tolerant collective communication for task-based distributed systems. In SIGCOMM 2021 Proceedings of the ACM SIGCOMM 2021 Conference (pp. 641–656). https://doi.org/10.1145/3452296.3472897

Zhuang, S., Z. Li, D. Zhuo, S. Wang, E. Liang, R. Nishihara, P. Moritz, and I. Stoica. “Hoplite: Efficient and fault-tolerant collective communication for task-based distributed systems.” In SIGCOMM 2021 Proceedings of the ACM SIGCOMM 2021 Conference, 641–56, 2021. https://doi.org/10.1145/3452296.3472897.

Zhuang S, Li Z, Zhuo D, Wang S, Liang E, Nishihara R, et al. Hoplite: Efficient and fault-tolerant collective communication for task-based distributed systems. In: SIGCOMM 2021 Proceedings of the ACM SIGCOMM 2021 Conference. 2021. p. 641–56.

Zhuang, S., et al. “Hoplite: Efficient and fault-tolerant collective communication for task-based distributed systems.” SIGCOMM 2021 Proceedings of the ACM SIGCOMM 2021 Conference, 2021, pp. 641–56. Scopus, doi:10.1145/3452296.3472897.

Zhuang S, Li Z, Zhuo D, Wang S, Liang E, Nishihara R, Moritz P, Stoica I. Hoplite: Efficient and fault-tolerant collective communication for task-based distributed systems. SIGCOMM 2021 Proceedings of the ACM SIGCOMM 2021 Conference. 2021. p. 641–656.

Published In

SIGCOMM 2021 Proceedings of the ACM SIGCOMM 2021 Conference

DOI

10.1145/3452296.3472897

Publication Date

August 9, 2021

Start / End Page

641 / 656