Skip to main content

INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION

Publication ,  Conference
Wang, W; Li, M
Published in: ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings
January 1, 2022

In this paper, we propose an end-to-end target-speaker voice activity detection (E2E-TS-VAD) method for speaker diarization. First, a ResNet-based network extracts the frame-level speaker embeddings from the acoustic features. Then, the L2-normalized frame-level speaker embeddings are fed to the transformer encoder which produces the initialization of the speaker diarization results. Later, the frame-level speaker embeddings are aggregated to several target-speaker embeddings based on the output from the transformer encoder. Finally, a BiLSTM-based TS-VAD model predicts the refined diarization results. Several aggregation methods are explored, including soft/hard decisions with/without normalization. Results show that E2E-TS-VAD achieves better performance than the original TS-VAD method with the clustering-based initialization.

Duke Scholars

Published In

ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

DOI

ISSN

1520-6149

Publication Date

January 1, 2022

Volume

2022-May

Start / End Page

2739 / 2743
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Wang, W., & Li, M. (2022). INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION. In ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings (Vol. 2022-May, pp. 2739–2743). https://doi.org/10.1109/ICASSP43922.2022.9747772
Wang, W., and M. Li. “INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION.” In ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2022-May:2739–43, 2022. https://doi.org/10.1109/ICASSP43922.2022.9747772.
Wang W, Li M. INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION. In: ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. 2022. p. 2739–43.
Wang, W., and M. Li. “INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION.” ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 2022-May, 2022, pp. 2739–43. Scopus, doi:10.1109/ICASSP43922.2022.9747772.
Wang W, Li M. INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION. ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. 2022. p. 2739–2743.

Published In

ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

DOI

ISSN

1520-6149

Publication Date

January 1, 2022

Volume

2022-May

Start / End Page

2739 / 2743