CROSS-CHANNEL ATTENTION-BASED TARGET SPEAKER VOICE ACTIVITY DETECTION: EXPERIMENTAL RESULTS FOR THE M2MET CHALLENGE
In this paper, we present the speaker diarization system for the Multichannel Multi-party Meeting Transcription Challenge (M2MeT) from team DKU DukeECE. As the highly overlapped speech exists in the dataset, we employ an x-vector-based target-speaker voice activity detection (TS-VAD) to find the overlap between speakers. Firstly, we separately train a single-channel model for each of the 8 channels and fuse the results. In addition, we also employ the cross-channel self-attention to further improve the performance, where the non-linear spatial correlations between different channels are learned and fused. Experimental results on the evaluation set show that the single-channel TS-VAD reduces the DER by over 75% from 12.68% to 3.14%. The multi-channel TS-VAD further reduces the DER by 28% and achieves a DER of 2.26%. Our final submitted system achieves a DER of 2.98% on the AliMeeting test set, which ranks 1st in the M2MET challenge. In this challenge, our team is denoted as A41.