
Tackling the small imbalanced horizontal dataset regressions by Stability Selection and SMOGN: a case study of ventilation-free days prediction in the pediatric intensive care unit and the importance of PRISM.
OBJECTIVE: The regression of small imbalanced horizontal datasets is an important problem in bioinformatics due to rare but vital data points impacting model performance. Most clinical studies suffer from imbalance in their distribution which impacts the learning ability of regression or classification models. The imbalance once combined with the small number of samples reduces the prediction performance. An improvement in the trainability of small imbalanced datasets hugely improves the potency of current prediction models that rely on a small set of valuable expensive samples. MATERIALS AND METHODS: A method called Stability Selection has been used to overcome the high dimensionality problem, which arises when the sample sizes are relatively small compared to the number of features. The method was used to improve the performance of the Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (SMOGN), an imbalance removal algorithm. To test the new pipeline, a small imbalanced cohort of pediatric ICU patients was used to predict the number of Ventilator-Free Days (VFD) a patient may experience for an admission period of 28 days due to respiratory illnesses. RESULTS: Our model demonstrated its effectiveness by overcoming label imbalance while predicting almost all the non-surviving patients in the test dataset using Stability Selection before applying SMOGN. Our study also highlighted the importance of Pediatrics Risk of Mortality (PRISM) as a powerful VFD predictor if combined with other clinical features. CONCLUSION: This paper shows how a hybrid strategy of Stability Selection, SMOGN, and regression can improve the outcome of highly imbalanced datasets and reduce the probability of highly expensive false negative detections in severe acute respiratory disease syndrome cases. The proposed modeling pipeline can reduce the overall VFD regression error but is also expandable to other regressable features. We also showed the importance of PRISM as a strong VFD predictor.
Duke Scholars
Published In
DOI
EISSN
Publication Date
Volume
Start / End Page
Location
Related Subject Headings
- Respiration, Artificial
- Regression Analysis
- Normal Distribution
- Medical Informatics
- Intensive Care Units, Pediatric
- Infant
- Humans
- Child, Preschool
- Child
- Algorithms
Citation

Published In
DOI
EISSN
Publication Date
Volume
Start / End Page
Location
Related Subject Headings
- Respiration, Artificial
- Regression Analysis
- Normal Distribution
- Medical Informatics
- Intensive Care Units, Pediatric
- Infant
- Humans
- Child, Preschool
- Child
- Algorithms