Unsupervised Coverage Sampling to Enhance Clinical Chart Review Coverage for Computable Phenotype Development: Simulation and Empirical Study.
BACKGROUND: Developing computable phenotypes (CP) based on electronic health records (EHR) data requires "gold-standard" labels for the outcome of interest. To generate these labels, clinicians typically chart-review a subset of patient charts. Charts to be reviewed are most often randomly sampled from the larger set of patients of interest. However, random sampling may fail to capture the diversity of the patient population, particularly if smaller subpopulations exist among those with the condition of interest. This can lead to poorly performing and biased CPs. OBJECTIVE: This study aimed to propose an unsupervised sampling approach designed to better capture a diverse patient cohort and improve the information coverage of chart review samples. METHODS: Our coverage sampling method starts by clustering by the patient population of interest. We then perform a stratified sampling from each cluster to ensure even representation for the chart review sample. We introduce a novel metric, nearest neighbor distance, to evaluate the coverage of the generated sample. To evaluate our method, we first conducted a simulation study to model and compare the performance of random versus our proposed coverage sampling. We varied the size and number of subpopulations within the larger cohort. Finally, we apply our approach to a real-world data set to develop a CP for hospitalization due to COVID-19. We evaluate the different sampling strategies based on the information coverage as well as the performance of the learned CP, using the area under the receiver operator characteristic curve. RESULTS: Our simulation studies show that the unsupervised coverage sampling approach provides broader coverage of patient populations compared to random sampling. When there are no underlying subpopulations, both random and coverage perform equally well for CP development. When there are subgroups, coverage sampling achieves area under the receiver operating characteristic curve gains of approximately 0.03-0.05 over random sampling. In the real-world application, the approach also outperformed random sampling, generating both a more representative sample and an area under the receiver operating characteristic curve improvement of 0.02 (95% CI -0.08 to 0.04). CONCLUSIONS: The proposed coverage sampling method is an easy-to-implement approach that produces a chart review sample that is more representative of the source population. This allows one to learn a CP that has better performance both for subpopulations and the overall cohort. Studies that aim to develop CPs should consider alternative strategies other than randomly sampling patient charts.
Duke Scholars
Published In
DOI
EISSN
Publication Date
Volume
Start / End Page
Location
Related Subject Headings
- SARS-CoV-2
- Phenotype
- Humans
- Electronic Health Records
- Computer Simulation
- Cluster Analysis
- COVID-19
- 4203 Health services and systems
Citation
Published In
DOI
EISSN
Publication Date
Volume
Start / End Page
Location
Related Subject Headings
- SARS-CoV-2
- Phenotype
- Humans
- Electronic Health Records
- Computer Simulation
- Cluster Analysis
- COVID-19
- 4203 Health services and systems