Multi-disease Classification of CT Reports using Traditional Natural Language Processing and a Lightweight Foundation Model
Natural language processing (NLP) methods can annotate free-text radiology reports to create large datasets at the scale of an entire health system or beyond. Generalizing the disease classification across multiple organ systems inherently requires a complex, robust, and accurate classification model. Concurrently, NLP methods have significantly improved and become more sophisticated. This study compares two traditional NLP methods, a rule-based algorithm (RBA) and a Bidirectional Long Short-Term Memory network (BiLSTM), with a lightweight variant of the Large Language Model Meta AI (Llama) model. Our goal is to analyze the capabilities and limitations of each model in accurately classifying diseases encountered within the chest, abdominal, and pelvic computed tomography (CT) exams of the body. Rule-based algorithms (RBAs) were used to extract disease labels from the “findings” section of CT radiology reports, creating the training, validation, and testing datasets. Disease labels were made for three organ systems: the lungs/pleura, liver/gallbladder, and kidneys/ureters. A BiLSTM network with an attention mechanism was trained on 151,431 cases and tested on 85,987 cases. The BiLSTM and Meta's Llama3.1-8B model was evaluated on the RBA-test set and a manually annotated dataset. On the smaller, manually labeled test set, the RBA model achieved the highest macro F1 score (0.94), followed by the BiLSTM (0.91) and then Llama (0.89). In contrast, on the larger RBA-labeled test set, the BiLSTM maintained high performance (average AUC > 0.98; macro F1 = 0.95), while Llama's macro F1 dropped to 0.65. Manual spot checking of reports where Llama disagreed with RBA/BiLSTM revealed numerous instances in which Llama was actually correct, indicating flaws with the previous RBA labeling. This study emphasizes the limitations of rule-based approaches and the need to consider clinical context in ambiguous scenarios. Llama3.1-8B exhibits the potential to outperform rule-based methods, indicating promise for reliable, large-scale multi-disease classification in CT text reports.