Scholars@Duke publication: AutoSync: Learning to synchronize for data-parallel distributed deep learning

AutoSync: Learning to synchronize for data-parallel distributed deep learning

Publication , Conference

Zhang, H; Li, Y; Deng, Z; Liang, X; Carin, L; Xing, EP

Published in: Advances in Neural Information Processing Systems

January 1, 2020

Synchronization is a key step in data-parallel distributed machine learning (ML). Different synchronization systems and strategies perform differently, and to achieve optimal parallel training throughput requires synchronization strategies that adapt to model structures and cluster configurations. Existing synchronization systems often only consider a single or a few synchronization aspects, and the burden of deciding the right synchronization strategy is then placed on the ML practitioners, who may lack the required expertise. In this paper, we develop a model- and resource-dependent representation for synchronization, which unifies multiple synchronization aspects ranging from architecture, message partitioning, placement scheme, to communication topology. Based on this representation, we build an end-to-end pipeline, AutoSync, to automatically optimize synchronization strategies given model structures and resource specifications, lowering the bar for data-parallel distributed ML. By learning from low-shot data collected in only 200 trial runs, AutoSync can discover synchronization strategies up to 1.6x better than manually optimized ones. We develop transfer-learning mechanisms to further reduce the auto-optimization cost – the simulators can transfer among similar model architectures, among similar cluster configurations, or both. We also present a dataset that contains nearly 10000 strategy and run-time pairs on a diverse set of models and cluster specifications.

Duke Scholars

Author Lawrence Carin Electrical and Computer Engineering

Published In

Advances in Neural Information Processing Systems

ISSN

1049-5258

Publication Date

January 1, 2020

Volume

2020-December

Related Subject Headings

4611 Machine learning
1702 Cognitive Sciences
1701 Psychology

Citation

APA

Chicago

ICMJE

MLA

NLM

Zhang, H., Li, Y., Deng, Z., Liang, X., Carin, L., & Xing, E. P. (2020). AutoSync: Learning to synchronize for data-parallel distributed deep learning. In Advances in Neural Information Processing Systems (Vol. 2020-December).

Zhang, H., Y. Li, Z. Deng, X. Liang, L. Carin, and E. P. Xing. “AutoSync: Learning to synchronize for data-parallel distributed deep learning.” In Advances in Neural Information Processing Systems, Vol. 2020-December, 2020.

Zhang H, Li Y, Deng Z, Liang X, Carin L, Xing EP. AutoSync: Learning to synchronize for data-parallel distributed deep learning. In: Advances in Neural Information Processing Systems. 2020.

Zhang, H., et al. “AutoSync: Learning to synchronize for data-parallel distributed deep learning.” Advances in Neural Information Processing Systems, vol. 2020-December, 2020.

Zhang H, Li Y, Deng Z, Liang X, Carin L, Xing EP. AutoSync: Learning to synchronize for data-parallel distributed deep learning. Advances in Neural Information Processing Systems. 2020.

Published In

Advances in Neural Information Processing Systems

ISSN

1049-5258

Publication Date

January 1, 2020

Volume

2020-December

Related Subject Headings

4611 Machine learning
1702 Cognitive Sciences
1701 Psychology