Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models.
To explore the performance of 4 large language model (LLM) chatbots for the analysis of 2 of the most commonly used tools for the advanced analysis of systematic reviews (SRs) and meta-analyses.We explored the performance of 4 LLM chatbots (ChatGPT, Gemini, DeepSeek, and QWEN) for the analysis of ROBIS and AMSTAR 2 tools (sample sizes: 20 SRs), in comparison with assessments by human experts.Gemini showed the best agreement with human experts for both ROBIS and AMSTAR 2 (accuracy: 58% and 70%). The second best LLM chatbots were ChatGPT and QWEN, for ROBIS and AMSTAR 2, respectively.Some LLM chatbots underestimated the risk of bias or overestimated the confidence of the results in published SRs, which is compatible with recent articles for other tools.This is one of the first studies comparing the performance of several LLM chatbots for the automated analyses of ROBIS and AMSTAR 2.
Duke Scholars
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Issue
Start / End Page
Related Subject Headings
- Systematic Reviews as Topic
- Natural Language Processing
- Meta-Analysis as Topic
- Medical Informatics
- Large Language Models
- Language
- Humans
- Bias
- 46 Information and computing sciences
- 42 Health sciences
Citation
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Issue
Start / End Page
Related Subject Headings
- Systematic Reviews as Topic
- Natural Language Processing
- Meta-Analysis as Topic
- Medical Informatics
- Large Language Models
- Language
- Humans
- Bias
- 46 Information and computing sciences
- 42 Health sciences