JOINT INFERENCE OF SPEAKER DIARIZATION AND ASR WITH MULTI-STAGE INFORMATION SHARING
In this paper, we introduce a novel approach that unifies Automatic Speech Recognition (ASR) and speaker diarization in a cohesive framework. Utilizing the synergies between the two tasks, our method effectively extracts speaker-specific information from the lower layers of a pretrained Conformer-based ASR model while leveraging the higher layers for enhanced diarization performance. In particular, the integration of ASR contextual details into the diarization process has been demonstrated to be effective. Results on the DIHARD III dataset indicate that our approach achieves a Diarization Error Rate (DER) of 10.52%, which can be further reduced to 10.39% when integrating ASR features into the diarization model. These findings highlight the potential of our approach, suggesting competitive performance against other state-of-the-art systems. Additionally, our framework's ability to simultaneously generate text transcripts for each speaker marks a distinct advantage, which can further enhance ASR capabilities and transition towards an end-to-end multitask framework encompassing both ASR and speaker diarization.