Role-aware Speaker Diarization in Autism Interview Scenarios
Speaker diarization technology plays a pivotal role in the field of intelligent speech transcription, with its core task being the segmentation and clustering of multi-speaker audio based on speaker identities, thereby facilitating better organization of audio content and transcribed text. In the scenarios of medical interview, speaker diarization technology serves as a prerequisite for subsequent automated assessment. Role information is naturally present in the field of medical interactive dialogue, taking autism as an example, the typical situation includes three well-defined roles: doctor, parent, and child undergoing diagnosis. However, in actual conversation, the correspondence between the role and the speaker may not always be one-to-one. For instance, during autism diagnosis, each conversation may involve only one child, while the number of doctors or parents may vary. We believe that the role information and the speaker information embedded in each speech segment can effectively complement each other, thereby reducing the diarization error rate. In this study, we propose a method integrating role information into the sequence-to-sequence target speaker voice activity detection(Seq2Seq-TSVAD) framework, achieving a diarization error rate(DER) of 20. 61 % on the CPEP-3 dataset. This error rate is 9. 8% lower compared to the Seq2Seq-TSVAD baseline method and 19. 3% lower compared to the conventional modular speaker diarization method, underscoring the significant effect of role information in enhancing speaker diarization performance in autism interview scenarios.