Scholars@Duke publication: Automated Data Harmonization in Clinical Research: Natural Language Processing Approach.

Automated Data Harmonization in Clinical Research: Natural Language Processing Approach.

Publication , Journal Article

Mallya, P; Henao, R; Hong, C; Wojdyla, D; Schibler, T; Manchanda, V; Pencina, M; Hall, J; Zhao, J

Published in: JMIR Form Res

August 27, 2025

BACKGROUND: Integrating data is essential for advancing clinical and epidemiological research. However, because datasets often describe variables (eg, demographic and health conditions) in diverse ways, the process of integrating and harmonizing variables from research studies remains a major bottleneck. OBJECTIVE: The objective was to assess a natural language processing-based method to automate variable harmonization to achieve a scalable approach to integration of multiple datasets. METHODS: We developed a fully connected neural network (FCN) method, enhanced with contrastive learning, using domain-specific embeddings from the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining language representation model, using 3 cardiovascular datasets: the Atherosclerosis Risk in Communities study, the Framingham Heart Study, and the Multi-Ethnic Study of Atherosclerosis. We used metadata variable descriptions and curated harmonized concepts as ground truth. We framed the problem as a paired sentence classification task. The accuracy of this method was compared with a logistic regression baseline method. To assess the generalizability of the trained models, we also evaluated their performance by separating the 3 datasets when preparing the training and validation sets. RESULTS: The newly developed FCN achieved a top-5 accuracy of 98.95% (95% CI 98.31%-99.47%) and an area under the receiver operating characteristic (AUC) of 0.99 (95% CI 0.98-0.99), outperforming the standard logistic regression model, which exhibited a top-5 accuracy of 22.23% (95% CI 19.91%-24.87%) and an AUC of 0.82 (95% CI 0.81-0.83). The contrastive learning enhancement also outperformed the logistic regression model, although slightly below the base FCN model, exhibiting a top-5 accuracy of 89.88% (95% CI 87.88%-91.68%) and an AUC of 0.98 (95% CI 0.97-0.98). CONCLUSIONS: This novel approach provides a scalable solution for harmonizing metadata across large-scale cohort studies. The proposed method significantly enhances the performance over the baseline method by using learned representations to categorize harmonized concepts more accurately for cohorts in cardiovascular disease and stroke.

Duke Scholars

Author Ricardo Henao Biostatistics & Bioinformatics, Division of Translational Bi ...

Author Daniel Wojdyla

Author Michael J Pencina Biostatistics & Bioinformatics, Division of Biostatistics

Author Chuan Hong Biostatistics & Bioinformatics, Division of Translational Bi ...

Published In

JMIR Form Res

DOI

10.2196/75608

EISSN

2561-326X

Publication Date

August 27, 2025

Volume

Start / End Page

e75608

Location

Canada

Related Subject Headings

Neural Networks, Computer
Natural Language Processing
Humans
Data Mining
Biomedical Research
42 Health sciences
32 Biomedical and clinical sciences

Citation

APA

Chicago

ICMJE

MLA

NLM

Mallya, P., Henao, R., Hong, C., Wojdyla, D., Schibler, T., Manchanda, V., … Zhao, J. (2025). Automated Data Harmonization in Clinical Research: Natural Language Processing Approach. JMIR Form Res, 9, e75608. https://doi.org/10.2196/75608

Mallya, Pratheek, Ricardo Henao, Chuan Hong, Daniel Wojdyla, Tony Schibler, Vihaan Manchanda, Michael Pencina, Jennifer Hall, and Juan Zhao. “Automated Data Harmonization in Clinical Research: Natural Language Processing Approach.” JMIR Form Res 9 (August 27, 2025): e75608. https://doi.org/10.2196/75608.

Mallya P, Henao R, Hong C, Wojdyla D, Schibler T, Manchanda V, et al. Automated Data Harmonization in Clinical Research: Natural Language Processing Approach. JMIR Form Res. 2025 Aug 27;9:e75608.

Mallya, Pratheek, et al. “Automated Data Harmonization in Clinical Research: Natural Language Processing Approach.” JMIR Form Res, vol. 9, Aug. 2025, p. e75608. Pubmed, doi:10.2196/75608.

Mallya P, Henao R, Hong C, Wojdyla D, Schibler T, Manchanda V, Pencina M, Hall J, Zhao J. Automated Data Harmonization in Clinical Research: Natural Language Processing Approach. JMIR Form Res. 2025 Aug 27;9:e75608.

Published In

JMIR Form Res

DOI

10.2196/75608

EISSN

2561-326X

Publication Date

August 27, 2025

Volume

Start / End Page

e75608

Location

Canada

Related Subject Headings

Neural Networks, Computer
Natural Language Processing
Humans
Data Mining
Biomedical Research
42 Health sciences
32 Biomedical and clinical sciences