Cognitive Distraction Detection Using Gaze and Pupil with an Interpretable Approach
Cognitive distraction (CD) is one of the major causes of traffic accidents, but there remains room to improve its detection. Most prior research on CD detection has commonly used basic statistical measures (e.g., mean, standard deviation) of driver-facing camera signals such as gaze and pupil size. However, these signals often exhibit subtle and complex patterns that conventional approaches cannot fully capture. In this paper, we evaluate a wide range of machine learning models and feature extraction methods using data from 52 participants in a driving simulator under two cognitive distraction inducing tasks (n-back and statement tasks). Our results demonstrate that combining gaze, pupil, and features derived from physiological signals (e.g., fixation saccade ratio and gaze entropy) and comprehensive time-series feature extraction boosts detection performance. While deep neural networks (Transformers) excel at modeling intricate relationships, our results show that tree-based ensemble methods (e.g., CatBoost) achieve comparable or higher detection performance while maintaining their advantage of better interpretability. Cross-task experiments further show that models trained on one type of task can generalize to another task. Feature analyses (via SHAP and Sobol) reveal that nonlinearity in vertical gaze movements, baseline pupil size, and greater minimum gaze distance are related to CD. These findings suggest that integrating multiple modalities, sophisticated feature engineering, and employing models capable of capturing nonlinear interactions are effective strategies for detecting CD. To support future research in this field, we release our code and preprocessed data: https://toyotaresearchinstitute.github.io/IV25-cognitive-distraction/.