Scholars@Duke publication: Can Contemporary Large Language Models Provide the Domain Knowledge Needed for Causal Inference? Evaluating Automated Causal Graph Discovery Through an ASCVD Case Study.

Can Contemporary Large Language Models Provide the Domain Knowledge Needed for Causal Inference? Evaluating Automated Causal Graph Discovery Through an ASCVD Case Study.

Publication , Journal Article

Aziz, M; Brookhart, MA

Published in: Clinical epidemiology

January 2025

Directed acyclic graphs (DAGs) are critical in epidemiology and public health research for guiding study design and minimizing bias. Yet, developing DAGs for causal inference requires substantial domain knowledge. Given the vast amounts of training data for large language models (LLMs), this study assesses the effectiveness of prompt engineering for LLMs to generate DAGs that depict causal relationships in population health using OpenAI's GPT-4o and GPT-o1.We consider a hypothetical study on statins vs no treatment for prevention of cardiovascular disease in a general adult population. We assessed four types of prompt engineering strategies: zero-shot, one-shot, instruction based, and chain of thought (CoT) prompts. Generated DAGs were assessed based on consistency, acyclicity, accuracy of sources, completeness (based on ASCVD risk score criteria), and adherence to the prompt.We found that all generated DAGs were acyclic, except for one run using the instruction-based prompt. Additionally, more than half of the DAGs included 6/7 of the ASCVD criteria, though race was absent from all. Overall, CoT resulted in the most complete DAGs and one-shot provided the most consistency across runs and adherence to the task in the prompt. The zero-shot prompt performed notably better on GPT-o1 compared to GPT-4o, consistently providing justifications and sources for variable inclusion.While the findings suggest that LLMs have a baseline capacity to generate DAGs that adhere to basic epidemiological conventions, we also found several limitations including lack of justification, systematic omission of race, and frequent source hallucination, highlighting the need for human oversight and expertise. We conclude that contemporary LLMs cannot replace a domain expert's judgment but may serve as a brainstorming or pre-analysis tool for DAG development when guided by well-engineered prompts.