SBDH-Reader: a large language model-powered method for extracting social and behavioral determinants of health from clinical notes.
OBJECTIVE: Social and behavioral determinants of health (SBDH) are increasingly recognized as essential for prognostication and informing targeted interventions. Clinical notes often contain details about SBDH in unstructured format. Conventional extraction methods for these data tend to be labor intensive, inaccurate, and/or unscalable. In this study, we aim to develop and validate a large language model (LLM)-powered method to extract structured SBDH data from clinical notes through prompt engineering. MATERIALS AND METHODS: We developed SBDH-Reader to extract 6 categories of granular SBDH data by prompting GPT-4o, including employment, housing, marital status, and substance use including alcohol, tobacco, and drug use. SBDH-Reader was developed using 7225 notes from 6382 patients in the MIMIC-III database (2001-2012) and externally validated using 971 notes from 437 patients at The University of Texas Southwestern Medical Center (UTSW; 2022-2023). We evaluated SBDH-Reader's performance against human-annotated ground truths based on precision, recall, F1, and confusion matrix. RESULTS: When tested on the UTSW validation set, SBDH-Reader achieved a macro-average F1 ranging from 0.94 to 0.98 across 6 SBDH categories. For clinically relevant adverse attributes, F1 ranged from 0.96 (employment; housing) to 0.99 (tobacco use). When extracting any adverse attributes across all SBDH categories, SBDH-Reader achieved an F1 of 0.97, recall of 0.97, and precision of 0.98 in the independent validation set. DISCUSSION: SBDH-Reader demonstrated strong performance in extracting structured SBDH data through effective prompt engineering of a general-purpose LLM, without the need for task-specific fine-tuning. Its modular design and adaptability to diverse datasets and documentation patterns support its applicability in real-world clinical settings. CONCLUSION: SBDH-Reader has the potential to serve as a scalable and effective method for collecting real-time, patient-level SBDH data to support clinical research and care.
Duke Scholars
Published In
DOI
EISSN
Publication Date
Volume
Issue
Start / End Page
Location
Related Subject Headings
- Social Determinants of Health
- Natural Language Processing
- Medical Informatics
- Large Language Models
- Information Storage and Retrieval
- Humans
- Electronic Health Records
- 46 Information and computing sciences
- 42 Health sciences
- 32 Biomedical and clinical sciences
Citation
Published In
DOI
EISSN
Publication Date
Volume
Issue
Start / End Page
Location
Related Subject Headings
- Social Determinants of Health
- Natural Language Processing
- Medical Informatics
- Large Language Models
- Information Storage and Retrieval
- Humans
- Electronic Health Records
- 46 Information and computing sciences
- 42 Health sciences
- 32 Biomedical and clinical sciences