Skip to main content

Applying machine learning to understand write performance of large-scale parallel filesystems

Publication ,  Conference
Xie, B; Tan, Z; Carns, P; Chase, J; Harms, K; Lofstead, J; Oral, S; Vazhkudai, SS; Wang, F
Published in: Proceedings of PDSW 2019: IEEE/ACM 4th International Parallel Data Systems Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
November 1, 2019

In high-performance computing (HPC), I/O performance prediction offers the potential to improve the efficiency of scientific computing. In particular, accurate prediction can make runtime estimates more precise, guide users toward optimal checkpoint strategies, and better inform facility provisioning and scheduling policies. HPC I/O performance is notoriously difficult to predict and model, however, in large part because of inherent variability and a lack of transparency in the behaviors of constituent storage system components. In this work we seek to advance the state of the art in HPC I/O performance prediction by (1) modeling the mean performance to address high variability, (2) deriving model features from write patterns, system architecture and system configurations, and (3) employing Lasso regression model to improve model accuracy. We demonstrate the efficacy of our approach by applying it to a crucial subset of common HPC I/O motifs, namely, file-per-process checkpoint write workloads. We conduct experiments on two distinct production HPC platforms-Titan at the Oak Ridge Leadership Computing Facility and Cetus at the Argonne Leadership Computing Facility-to train and evaluate our models. We find that we can attain ≤ 30% relative error for 92.79% and 99.64% of the samples in our test set on these platforms, respectively.

Duke Scholars

Published In

Proceedings of PDSW 2019: IEEE/ACM 4th International Parallel Data Systems Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

DOI

ISBN

9781728160054

Publication Date

November 1, 2019

Start / End Page

30 / 39
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Xie, B., Tan, Z., Carns, P., Chase, J., Harms, K., Lofstead, J., … Wang, F. (2019). Applying machine learning to understand write performance of large-scale parallel filesystems. In Proceedings of PDSW 2019: IEEE/ACM 4th International Parallel Data Systems Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 30–39). https://doi.org/10.1109/PDSW49588.2019.00008
Xie, B., Z. Tan, P. Carns, J. Chase, K. Harms, J. Lofstead, S. Oral, S. S. Vazhkudai, and F. Wang. “Applying machine learning to understand write performance of large-scale parallel filesystems.” In Proceedings of PDSW 2019: IEEE/ACM 4th International Parallel Data Systems Workshop - Held in Conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis, 30–39, 2019. https://doi.org/10.1109/PDSW49588.2019.00008.
Xie B, Tan Z, Carns P, Chase J, Harms K, Lofstead J, et al. Applying machine learning to understand write performance of large-scale parallel filesystems. In: Proceedings of PDSW 2019: IEEE/ACM 4th International Parallel Data Systems Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis. 2019. p. 30–9.
Xie, B., et al. “Applying machine learning to understand write performance of large-scale parallel filesystems.” Proceedings of PDSW 2019: IEEE/ACM 4th International Parallel Data Systems Workshop - Held in Conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 30–39. Scopus, doi:10.1109/PDSW49588.2019.00008.
Xie B, Tan Z, Carns P, Chase J, Harms K, Lofstead J, Oral S, Vazhkudai SS, Wang F. Applying machine learning to understand write performance of large-scale parallel filesystems. Proceedings of PDSW 2019: IEEE/ACM 4th International Parallel Data Systems Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis. 2019. p. 30–39.

Published In

Proceedings of PDSW 2019: IEEE/ACM 4th International Parallel Data Systems Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

DOI

ISBN

9781728160054

Publication Date

November 1, 2019

Start / End Page

30 / 39