Skip to main content

INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION

Publication ,  Conference
Wang, W; Li, M
Published in: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
January 1, 2022

In this paper, we propose an end-to-end target-speaker voice activity detection (E2E-TS-VAD) method for speaker diarization. First, a ResNet-based network extracts the frame-level speaker embeddings from the acoustic features. Then, the L2-normalized frame-level speaker embeddings are fed to the transformer encoder which produces the initialization of the speaker diarization results. Later, the frame-level speaker embeddings are aggregated to several target-speaker embeddings based on the output from the transformer encoder. Finally, a BiLSTM-based TS-VAD model predicts the refined diarization results. Several aggregation methods are explored, including soft/hard decisions with/without normalization. Results show that E2E-TS-VAD achieves better performance than the original TS-VAD method with the clustering-based initialization.

Duke Scholars

Published In

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

DOI

ISSN

1520-6149

Publication Date

January 1, 2022

Volume

2022-May

Start / End Page

2739 / 2743
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Wang, W., & Li, M. (2022). INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (Vol. 2022-May, pp. 2739–2743). https://doi.org/10.1109/ICASSP43922.2022.9747772
Wang, W., and M. Li. “INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION.” In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2022-May:2739–43, 2022. https://doi.org/10.1109/ICASSP43922.2022.9747772.
Wang W, Li M. INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. 2022. p. 2739–43.
Wang, W., and M. Li. “INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION.” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2022-May, 2022, pp. 2739–43. Scopus, doi:10.1109/ICASSP43922.2022.9747772.
Wang W, Li M. INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. 2022. p. 2739–2743.

Published In

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

DOI

ISSN

1520-6149

Publication Date

January 1, 2022

Volume

2022-May

Start / End Page

2739 / 2743