Scholars@Duke publication: Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models.

Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models.

Publication , Journal Article

Forero, DA; Abreu, SE; Tovar, BE; Oermann, MH

Published in: Journal of the American Medical Informatics Association : JAMIA

September 2025

To explore the performance of 4 large language model (LLM) chatbots for the analysis of 2 of the most commonly used tools for the advanced analysis of systematic reviews (SRs) and meta-analyses.We explored the performance of 4 LLM chatbots (ChatGPT, Gemini, DeepSeek, and QWEN) for the analysis of ROBIS and AMSTAR 2 tools (sample sizes: 20 SRs), in comparison with assessments by human experts.Gemini showed the best agreement with human experts for both ROBIS and AMSTAR 2 (accuracy: 58% and 70%). The second best LLM chatbots were ChatGPT and QWEN, for ROBIS and AMSTAR 2, respectively.Some LLM chatbots underestimated the risk of bias or overestimated the confidence of the results in published SRs, which is compatible with recent articles for other tools.This is one of the first studies comparing the performance of several LLM chatbots for the automated analyses of ROBIS and AMSTAR 2.

Duke Scholars

Author Marilyn Haag Oermann School of Nursing

Published In

Journal of the American Medical Informatics Association : JAMIA

DOI

10.1093/jamia/ocaf117

EISSN

1527-974X

ISSN

1067-5027

Publication Date

September 2025

Volume

Issue

Start / End Page

1471 / 1476

Related Subject Headings

Systematic Reviews as Topic
Natural Language Processing
Meta-Analysis as Topic
Medical Informatics
Large Language Models
Language
Humans
Bias
46 Information and computing sciences
42 Health sciences

Citation

APA

Chicago

ICMJE

MLA

NLM

Forero, D. A., Abreu, S. E., Tovar, B. E., & Oermann, M. H. (2025). Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models. Journal of the American Medical Informatics Association : JAMIA, 32(9), 1471–1476. https://doi.org/10.1093/jamia/ocaf117

Forero, Diego A., Sandra E. Abreu, Blanca E. Tovar, and Marilyn H. Oermann. “Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models.” Journal of the American Medical Informatics Association : JAMIA 32, no. 9 (September 2025): 1471–76. https://doi.org/10.1093/jamia/ocaf117.

Forero DA, Abreu SE, Tovar BE, Oermann MH. Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models. Journal of the American Medical Informatics Association : JAMIA. 2025 Sep;32(9):1471–6.

Forero, Diego A., et al. “Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models.” Journal of the American Medical Informatics Association : JAMIA, vol. 32, no. 9, Sept. 2025, pp. 1471–76. Epmc, doi:10.1093/jamia/ocaf117.

Published In

Journal of the American Medical Informatics Association : JAMIA

DOI

10.1093/jamia/ocaf117

EISSN

1527-974X

ISSN

1067-5027

Publication Date

September 2025

Volume

Issue

Start / End Page

1471 / 1476

Related Subject Headings

Systematic Reviews as Topic
Natural Language Processing
Meta-Analysis as Topic
Medical Informatics
Large Language Models
Language
Humans
Bias
46 Information and computing sciences
42 Health sciences