Scholars@Duke publication: Fast Convergence to Fairness for Reduced Long Flow Tail Latency in Datacenter Networks

Fast Convergence to Fairness for Reduced Long Flow Tail Latency in Datacenter Networks

Publication , Conference

Snyder, J; Lebeck, AR

Published in: Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

January 1, 2022

Many data-intensive applications, such as distributed deep learning and data analytics, require moving vast amounts of data between compute servers in a distributed system. To meet the demands of these applications, datacenters are adopting Remote Direct Memory Access (RDMA), which has higher bandwidth and lower latency than traditional kernel-based networking. To ensure high performance of RDMA networks, congestion control manages queue depth on switches, and historically focused on moderating queue depth to ensure small flows complete quickly. Unfortunately, one side-effect of many common decisions is that large flows are starved of bandwidth. This negatively impacts the flow completion time (FCT) of large, bandwidth-bound flows, which are integral to the performance of data-intensive applications. The FCT is particularly impacted at the tail, which is increasingly critical for predictable application performance. We identify the root causes of the poor performance for long flows and measure the impact. We then design mechanisms that improve long flow FCT without compromising small flow performance. Our evaluations show that these improvements reduce 99.9% tail FCT of long flows by over 2x.

Duke Scholars

Author Alvin R. Lebeck Computer Science

Published In

Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

DOI

10.1109/IPDPS53621.2022.00102

Publication Date

January 1, 2022

Start / End Page

1007 / 1017

Citation

APA

Chicago

ICMJE

MLA

NLM

Snyder, J., & Lebeck, A. R. (2022). Fast Convergence to Fairness for Reduced Long Flow Tail Latency in Datacenter Networks. In Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022 (pp. 1007–1017). https://doi.org/10.1109/IPDPS53621.2022.00102

Snyder, J., and A. R. Lebeck. “Fast Convergence to Fairness for Reduced Long Flow Tail Latency in Datacenter Networks.” In Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022, 1007–17, 2022. https://doi.org/10.1109/IPDPS53621.2022.00102.

Snyder J, Lebeck AR. Fast Convergence to Fairness for Reduced Long Flow Tail Latency in Datacenter Networks. In: Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. 2022. p. 1007–17.

Snyder, J., and A. R. Lebeck. “Fast Convergence to Fairness for Reduced Long Flow Tail Latency in Datacenter Networks.” Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022, 2022, pp. 1007–17. Scopus, doi:10.1109/IPDPS53621.2022.00102.

Snyder J, Lebeck AR. Fast Convergence to Fairness for Reduced Long Flow Tail Latency in Datacenter Networks. Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. 2022. p. 1007–1017.

Published In

Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

DOI

10.1109/IPDPS53621.2022.00102

Publication Date

January 1, 2022

Start / End Page

1007 / 1017