Skip to main content

Collie: Finding Performance Anomalies in RDMA Subsystems

Publication ,  Conference
Kong, X; Zhu, Y; Zhou, H; Jiang, Z; Ye, J; Guo, C; Zhuo, D
Published in: Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022
January 1, 2022

High-speed RDMA networks are getting rapidly adopted in the industry for their low latency and reduced CPU overheads. To verify that RDMA can be used in production, system administrators need to understand the set of application workloads that can potentially trigger abnormal performance behaviors (e.g., unexpected low throughput, PFC pause frame storm). We design and implement Collie, a tool for users to systematically uncover performance anomalies in RDMA subsystems without the need to access hardware internal designs. Instead of individually testing each hardware device (e.g., NIC, memory, PCIe), Collie is holistic, constructing a comprehensive search space for application workloads. Collie then uses simulated annealing to drive RDMA-related performance and diagnostic counters to extreme value regions to find workloads that can trigger performance anomalies. We evaluate Collie on combinations of various RDMA NIC, CPU, and other hardware components. Collie found 15 new performance anomalies. All of them are acknowledged by the hardware vendors. 7 of them are already fixed after we reported them. We also present our experience in using Collie to avoid performance anomalies for an RDMA RPC library and an RDMA distributed machine learning framework.

Duke Scholars

Published In

Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022

Publication Date

January 1, 2022

Start / End Page

287 / 305
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Kong, X., Zhu, Y., Zhou, H., Jiang, Z., Ye, J., Guo, C., & Zhuo, D. (2022). Collie: Finding Performance Anomalies in RDMA Subsystems. In Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022 (pp. 287–305).
Kong, X., Y. Zhu, H. Zhou, Z. Jiang, J. Ye, C. Guo, and D. Zhuo. “Collie: Finding Performance Anomalies in RDMA Subsystems.” In Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022, 287–305, 2022.
Kong X, Zhu Y, Zhou H, Jiang Z, Ye J, Guo C, et al. Collie: Finding Performance Anomalies in RDMA Subsystems. In: Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022. 2022. p. 287–305.
Kong, X., et al. “Collie: Finding Performance Anomalies in RDMA Subsystems.” Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022, 2022, pp. 287–305.
Kong X, Zhu Y, Zhou H, Jiang Z, Ye J, Guo C, Zhuo D. Collie: Finding Performance Anomalies in RDMA Subsystems. Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022. 2022. p. 287–305.

Published In

Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022

Publication Date

January 1, 2022

Start / End Page

287 / 305