INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION
In this paper, we propose an end-to-end target-speaker voice activity detection (E2E-TS-VAD) method for speaker diarization. First, a ResNet-based network extracts the frame-level speaker embeddings from the acoustic features. Then, the L2-normalized frame-level speaker embeddings are fed to the transformer encoder which produces the initialization of the speaker diarization results. Later, the frame-level speaker embeddings are aggregated to several target-speaker embeddings based on the output from the transformer encoder. Finally, a BiLSTM-based TS-VAD model predicts the refined diarization results. Several aggregation methods are explored, including soft/hard decisions with/without normalization. Results show that E2E-TS-VAD achieves better performance than the original TS-VAD method with the clustering-based initialization.