How Foundational Is the Retina Foundation Model? Estimating RETFound's Label Efficiency on Binary Classification of Normal versus Abnormal OCT Images.
OBJECTIVE: While the availability of public internet-scale datasets of images and language has catalyzed remarkable progress in machine learning, medical datasets are constrained by regulations protecting patient privacy and the time and cost required for curation and labeling. Self-supervised learning or pretraining has demonstrated great success in learning meaningful representations from large unlabeled datasets to enable efficient learning on downstream tasks. In ophthalmology, the RETFound model, a large vision transformer (ViT-L) model trained by masked autoencoding on 1.6 million color fundus photos and OCT B-scans, is the first model pretrained at such scale for ophthalmology, demonstrating strong performance on downstream tasks from diabetic retinopathy grading to stroke detection. Here, we measure the label efficiency of the RETFound model in learning to identify normal vs. abnormal OCT B-scans obtained as part of a pilot study for primary care-based diabetic retinopathy screening in North Carolina. DESIGN: The 1150 TopCon Maestro OCT central B-scans (981 normal and 169 abnormal) were randomly split 80/10/10 into training, validation, and test datasets. Model training and hyperparameter tuning were performed on the training set guided by validation set performance. The best performing models were then evaluated on the final test set. SUBJECTS: Six hundred forty-seven patients with diabetes in the Duke Health System participating in primary care diabetic retinopathy screening contributed 1150 TopCon Maestro OCT central B-scans. METHODS: Three models (ResNet-50, ViT-L, and RETFound) were fine-tuned on the full training dataset of 915 OCT B-scans and on smaller training data subsets of 500, 250, 100, and 50 OCT B-scans, respectively, across 3 random seeds. MAIN OUTCOME MEASURES: Mean accuracy, area under the receiver operator curve (AUROC), area under the precision recall curve (AUPRC), F1 score, precision, and recall on the final held-out test set were reported for each model. RESULTS: Across 3 random seeds and all training dataset sizes, RETFound outperformed both ResNet-50 and ViT-L on all evaluation metrics on the final held-out test dataset. Large vision transformer and ResNet-50 performed comparably at the largest training dataset sizes of 915 and 500 OCT B-scans; however, ResNet-50 suffered more pronounced performance degradation at the smallest dataset sizes of 100 and 50 OCT B-scans. CONCLUSIONS: Our findings validate the benefits of RETFound's additional retina-specific pretraining. Further research is needed to establish best practices for fine-tuning RETFound to downstream tasks. FINANCIAL DISCLOSURES: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.