Skip to main content

MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud

Publication ,  Conference
Wu, Y; Xu, Y; Chen, J; Wang, Z; Zhang, Y; Lentz, M; Zhuo, D
Published in: ACM SIGCOMM 2024 - Proceedings of the 2024 ACM SIGCOMM 2024 Conference
August 4, 2024

Performance of collective communication is critical for distributed systems. Using libraries to implement collective communication algorithms is not a good fit for a multi-tenant cloud environment because the tenant is not aware of the underlying physical network configuration or how other tenants use the shared cloud network - -this lack of information prevents the library from selecting an optimal algorithm. In this paper, we explore a new approach for collective communication that more tightly integrates the implementation with the cloud network instead of the applications. We introduce MCCS, or Managed Collective Communication as a Service, which exposes traditional collective communication abstractions to applications while providing control and flexibility to the cloud provider for their implementations. Realizing MCCS involves overcoming several key challenges to integrate collective communication as part of the cloud network, including memory management of tenant GPU buffers, synchronizing changes to collective communication strategies, and supporting policies that involve cross-layer traffic optimization. Our evaluations show that MCCS improves tenant collective communication performance by up to 2.4× compared to one of the state-of-the-art collective communication libraries (NCCL), while adding more management features including dynamic algorithm adjustment, quality of service, and network-aware traffic engineering.

Duke Scholars

Published In

ACM SIGCOMM 2024 - Proceedings of the 2024 ACM SIGCOMM 2024 Conference

DOI

Publication Date

August 4, 2024

Start / End Page

679 / 690
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Wu, Y., Xu, Y., Chen, J., Wang, Z., Zhang, Y., Lentz, M., & Zhuo, D. (2024). MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud. In ACM SIGCOMM 2024 - Proceedings of the 2024 ACM SIGCOMM 2024 Conference (pp. 679–690). https://doi.org/10.1145/3651890.3672252
Wu, Y., Y. Xu, J. Chen, Z. Wang, Y. Zhang, M. Lentz, and D. Zhuo. “MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud.” In ACM SIGCOMM 2024 - Proceedings of the 2024 ACM SIGCOMM 2024 Conference, 679–90, 2024. https://doi.org/10.1145/3651890.3672252.
Wu Y, Xu Y, Chen J, Wang Z, Zhang Y, Lentz M, et al. MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud. In: ACM SIGCOMM 2024 - Proceedings of the 2024 ACM SIGCOMM 2024 Conference. 2024. p. 679–90.
Wu, Y., et al. “MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud.” ACM SIGCOMM 2024 - Proceedings of the 2024 ACM SIGCOMM 2024 Conference, 2024, pp. 679–90. Scopus, doi:10.1145/3651890.3672252.
Wu Y, Xu Y, Chen J, Wang Z, Zhang Y, Lentz M, Zhuo D. MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud. ACM SIGCOMM 2024 - Proceedings of the 2024 ACM SIGCOMM 2024 Conference. 2024. p. 679–690.

Published In

ACM SIGCOMM 2024 - Proceedings of the 2024 ACM SIGCOMM 2024 Conference

DOI

Publication Date

August 4, 2024

Start / End Page

679 / 690