Scholars@Duke publication: Classification performance bias between training and test sets in a limited mammography dataset.

Classification performance bias between training and test sets in a limited mammography dataset.

Publication , Journal Article

Hou, R; Lo, JY; Marks, JR; Hwang, ES; Grimm, LJ

Published in: PLoS One

2024

OBJECTIVES: To assess the performance bias caused by sampling data into training and test sets in a mammography radiomics study. METHODS: Mammograms from 700 women were used to study upstaging of ductal carcinoma in situ. The dataset was repeatedly shuffled and split into training (n = 400) and test cases (n = 300) forty times. For each split, cross-validation was used for training, followed by an assessment of the test set. Logistic regression with regularization and support vector machine were used as the machine learning classifiers. For each split and classifier type, multiple models were created based on radiomics and/or clinical features. RESULTS: Area under the curve (AUC) performances varied considerably across the different data splits (e.g., radiomics regression model: train 0.58-0.70, test 0.59-0.73). Performances for regression models showed a tradeoff where better training led to worse testing and vice versa. Cross-validation over all cases reduced this variability, but required samples of 500+ cases to yield representative estimates of performance. CONCLUSIONS: In medical imaging, clinical datasets are often limited to relatively small size. Models built from different training sets may not be representative of the whole dataset. Depending on the selected data split and model, performance bias could lead to inappropriate conclusions that might influence the clinical significance of the findings. ADVANCES IN KNOWLEDGE: Performance bias can result from model testing when using limited datasets. Optimal strategies for test set selection should be developed to ensure study conclusions are appropriate.

Duke Scholars

Author Joseph Yuan-Chieh Lo Radiology

Author Lars Johannes L Grimm Radiology, Breast Imaging

Altmetric Attention Stats

Dimensions Citation Stats

Published In

PLoS One

DOI

10.1371/journal.pone.0282402

EISSN

1932-6203

Publication Date

2024

Volume

Issue

Start / End Page

e0282402

Location

United States

Related Subject Headings

Retrospective Studies
Mammography
Machine Learning
Humans
General Science & Technology
Female

Citation

APA

Chicago

ICMJE

MLA

NLM

Hou, R., Lo, J. Y., Marks, J. R., Hwang, E. S., & Grimm, L. J. (2024). Classification performance bias between training and test sets in a limited mammography dataset. PLoS One, 19(2), e0282402. https://doi.org/10.1371/journal.pone.0282402

Hou, Rui, Joseph Y. Lo, Jeffrey R. Marks, E Shelley Hwang, and Lars J. Grimm. “Classification performance bias between training and test sets in a limited mammography dataset.” PLoS One 19, no. 2 (2024): e0282402. https://doi.org/10.1371/journal.pone.0282402.

Hou R, Lo JY, Marks JR, Hwang ES, Grimm LJ. Classification performance bias between training and test sets in a limited mammography dataset. PLoS One. 2024;19(2):e0282402.

Hou, Rui, et al. “Classification performance bias between training and test sets in a limited mammography dataset.” PLoS One, vol. 19, no. 2, 2024, p. e0282402. Pubmed, doi:10.1371/journal.pone.0282402.

Hou R, Lo JY, Marks JR, Hwang ES, Grimm LJ. Classification performance bias between training and test sets in a limited mammography dataset. PLoS One. 2024;19(2):e0282402.

Published In

PLoS One

DOI

10.1371/journal.pone.0282402

EISSN

1932-6203

Publication Date

2024

Volume

Issue

Start / End Page

e0282402

Location

United States

Related Subject Headings

Retrospective Studies
Mammography
Machine Learning
Humans
General Science & Technology
Female