Skip to main content
Journal cover image

Machine learning performance for a small dataset: random oversampling improves data imbalances and fairness.

Publication ,  Journal Article
Wang, L; Shi, E; Meyers, B; Vlachos, P; Tcheng, J; Denardo, S
Published in: BMC Med Res Methodol
February 4, 2026

BACKGROUND: Percutaneous coronary intervention (PCI) can be complicated by major adverse cardiovascular events (MACE; death, myocardial infarction [MI], and target vessel revascularization). Statistical models facilitated by machine learning (ML) can improve prediction of MACE compared with conventional models. However, two key challenges impair ML performance: (1) class imbalance in outcome distributions; and (2) bias affecting fairness across social subgroups. These challenges are amplified in small datasets. Additionally, no published ML model specifically addresses the influence of social determinants of health (SDoH) on PCI outcomes. We hypothesized that random oversampling would improve sensitivity and fairness when assessing the effect of SDoH on select PCI-associated MACE in a small dataset. METHODS: We employed multivariable logistic regression to predict 180-day MACE and new MI following urgent PCI in a small dataset (N = 481). Three SDoH were pre-specified variables: race, marital status, and socioeconomic status at extremes of the spectrum (uninsured and Medicaid versus private insurance). Random oversampling was applied to the SDoH and outcomes while maintaining a consistent ratio between positive and negative outcomes. RESULTS: In the imbalanced dataset, the logistic regression revealed no significant association between SDoH and 180-day MACE or new MI (minimum P-value = 0.47). Although sensitivity for event detection was low (≤ 0.26), other metrics-including positive predictive value (PPV)-were commendable (≥ 0.80). ML classifiers trained on oversampled data for race and marital status showed increased sensitivity but decreased PPV as the ratio of adverse-to-favorable outcome increased. Other performance metrics remained stable. Equalized odds disparity decreased with increasing oversampling ratio, indicating improved fairness. However, for socioeconomic-extremes status, the low numbers of adverse events and small sub-group size limited the effectiveness of oversampling. The optimal oversampling ratio cut-point across all outcomes and SDoH was 0.30-0.40 (sensitivity-0.50; PPV-0.52). CONCLUSIONS: In small, imbalanced datasets, random oversampling can improve sensitivity and fairness of ML classifiers when evaluating PCI outcomes across race and marital status. However, this improvement comes at the expense of decreased PPV. The decrease in equalized odds disparity indicates improved balance in the performance of the classifiers across these two subgroups.

Duke Scholars

Published In

BMC Med Res Methodol

DOI

EISSN

1471-2288

Publication Date

February 4, 2026

Volume

26

Issue

1

Location

England

Related Subject Headings

  • General & Internal Medicine
  • 4206 Public health
  • 4202 Epidemiology
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Wang, L., Shi, E., Meyers, B., Vlachos, P., Tcheng, J., & Denardo, S. (2026). Machine learning performance for a small dataset: random oversampling improves data imbalances and fairness. BMC Med Res Methodol, 26(1). https://doi.org/10.1186/s12874-026-02779-3
Wang, Lin, Elliott Shi, Brett Meyers, Pavlos Vlachos, James Tcheng, and Scott Denardo. “Machine learning performance for a small dataset: random oversampling improves data imbalances and fairness.BMC Med Res Methodol 26, no. 1 (February 4, 2026). https://doi.org/10.1186/s12874-026-02779-3.
Wang L, Shi E, Meyers B, Vlachos P, Tcheng J, Denardo S. Machine learning performance for a small dataset: random oversampling improves data imbalances and fairness. BMC Med Res Methodol. 2026 Feb 4;26(1).
Wang, Lin, et al. “Machine learning performance for a small dataset: random oversampling improves data imbalances and fairness.BMC Med Res Methodol, vol. 26, no. 1, Feb. 2026. Pubmed, doi:10.1186/s12874-026-02779-3.
Wang L, Shi E, Meyers B, Vlachos P, Tcheng J, Denardo S. Machine learning performance for a small dataset: random oversampling improves data imbalances and fairness. BMC Med Res Methodol. 2026 Feb 4;26(1).
Journal cover image

Published In

BMC Med Res Methodol

DOI

EISSN

1471-2288

Publication Date

February 4, 2026

Volume

26

Issue

1

Location

England

Related Subject Headings

  • General & Internal Medicine
  • 4206 Public health
  • 4202 Epidemiology