SBDH-Reader: an LLM-powered method for extracting social and behavioral determinants of health from clinical notes.
OBJECTIVE: Social and behavioral determinants of health (SBDH) are increasingly recognized as essential for prognostication and informing targeted interventions. Clinical notes often contain details about SBDH in unstructured format. Conventional extraction methods for these data tend to be labor intensive, inaccurate, and/or unscalable. In this study, we aim to develop and validate an LLM-powered method to extract structured SBDH data from clinical notes through prompt engineering. MATERIALS AND METHODS: We developed SBDH-Reader to extract six categories of granular SBDH data by prompting GPT-4o, including employment, housing, marital status, and substance use including alcohol, tobacco, and drug use. SBDH-Reader was developed using 7,225 notes from 6,382 patients in the MIMIC-III database (2001-2012) and externally validated using 971 notes from 437 patients at The University of Texas Southwestern Medical Center (UTSW; 2022-2023). We evaluated SBDH-Reader's performance against human-annotated ground truths based on precision, recall, F1, and confusion matrix. RESULTS: When tested on the UTSW validation set, SBDH-Reader achieved a macro-average F1 ranging from 0.94 to 0.98 across six SBDH categories. For clinically relevant adverse attributes, F1 ranged from 0.96 (employment; housing) to 0.99 (tobacco use). When extracting any adverse attributes across all SBDH categories, SBDH-Reader achieved an F1 of 0.97, recall of 0.97, and precision of 0.98 in the independent validation set. CONCLUSION: A general-purpose LLM can accurately extract structured SBDH data through effective prompt engineering. The SBDH-Reader has the potential to serve as a scalable and effective method for collecting real-time, patient-level SBDH data to support clinical research and care.