Assessing the Expressive Language Levels of Autistic Children in Home Intervention
The World Health Organization (WHO) has established the caregiver skill training (CST) program, designed to empower families with children diagnosed with autism spectrum disorder the essential caregiving skills. The joint engagement rating inventory (JERI) protocol evaluates participants’ engagement levels within the CST initiative. Traditionally, rating the expressive language level and use (EXLA) item in JERI relies on retrospective video analysis conducted by qualified professionals, thus incurring substantial labor costs. This study introduces a multimodal behavioral signal-processing framework designed to analyze both child and caregiver behaviors automatically, thereby rating EXLA. Initially, raw audio and video signals are segmented into concise intervals via voice activity detection, speaker diarization and speaker age classification, serving the dual purpose of eliminating nonspeech content and tagging each segment with its respective speaker. Subsequently, we extract an array of audio-visual features, encompassing our proposed interpretable, hand-crafted textual features, end-to-end audio embeddings and end-to-end video embeddings. Finally, these features are fused at the feature level to train a linear regression model aimed at predicting the EXLA scores. Our framework has been evaluated on the largest in-the-wild database currently available under the CST program. Experimental results indicate that the proposed system achieves a Pearson correlation coefficient of 0.768 against the expert ratings, evidencing promising performance comparable to that of human experts.