Performance of ChatGPT versus spine surgeons as an emergency department spine call consultant
Background Large language models (LLMs) like ChatGPT are increasingly being recognized as credible tools for use across diverse healthcare settings. While artificial intelligence (AI) use has previously been evaluated in emergency medicine, its use in subspecialty care - particularly spine surgery - remains underexplored. This study evaluates the clinical accuracy, management appropriateness, completeness, helpfulness, and overall quality of ChatGPT responses compared to those of board-certified, spine surgeons in response to common emergency department (ED) consultations. Methods A 7-part questionnaire was developed based on common ED spine consultations (eg, Cauda Equina Syndrome, compression fracture in elderly patients, purulent drainage from surgical wound, acute lumbar disc herniation, incomplete spinal cord injury, epidural abscess, and metastatic spine disease). Each case included 3–4 questions pertaining to examination, diagnosis, management, and counseling. Responses from ChatGPT and 7 board-certified spine surgeons were restricted to 3–4 sentences per question. Three emergency medicine physicians rated each de-identified questionnaire response using a 5-point Likert scale. Statistical analysis was conducted using a 2-sample T-test with unequal variance. Inter-rater reliability was assessed using pairwise weighted Cohen’s kappa coefficient (κ). Results When comparing AI responses versus spine surgeon responses to proposed ED consultations, AI responses were rated to be superior across all 5 metrics of clinical accuracy, management appropriateness, completeness, helpfulness, and overall quality (p<.05). Inter-rater reliability was assessed using the average pairwise weighted Cohen’s kappa coefficient which showed substantial agreement (κ=0.76). Conclusions ChatGPT responses to emergency department spine consultations were rated as significantly higher compared to board-certified spine surgeons by emergency medicine providers. Though further improvement and validation is warranted, these findings suggest that ChatGPT can be a useful clinical adjunct for spine-related emergency department consultations.