Scholars@Duke publication: INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION

INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION

Publication , Conference

Wang, W; Li, M

Published in: ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

January 1, 2022

In this paper, we propose an end-to-end target-speaker voice activity detection (E2E-TS-VAD) method for speaker diarization. First, a ResNet-based network extracts the frame-level speaker embeddings from the acoustic features. Then, the L2-normalized frame-level speaker embeddings are fed to the transformer encoder which produces the initialization of the speaker diarization results. Later, the frame-level speaker embeddings are aggregated to several target-speaker embeddings based on the output from the transformer encoder. Finally, a BiLSTM-based TS-VAD model predicts the refined diarization results. Several aggregation methods are explored, including soft/hard decisions with/without normalization. Results show that E2E-TS-VAD achieves better performance than the original TS-VAD method with the clustering-based initialization.

Duke Scholars

Author Ming Li DKU Faculty

Published In

ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

DOI

10.1109/ICASSP43922.2022.9747772

ISSN

1520-6149

Publication Date

January 1, 2022

Volume

2022-May

Start / End Page

2739 / 2743

Citation

APA

Chicago

ICMJE

MLA

NLM

Wang, W., & Li, M. (2022). INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION. In ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings (Vol. 2022-May, pp. 2739–2743). https://doi.org/10.1109/ICASSP43922.2022.9747772

Wang, W., and M. Li. “INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION.” In ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2022-May:2739–43, 2022. https://doi.org/10.1109/ICASSP43922.2022.9747772.

Wang W, Li M. INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION. In: ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. 2022. p. 2739–43.

Wang, W., and M. Li. “INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION.” ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 2022-May, 2022, pp. 2739–43. Scopus, doi:10.1109/ICASSP43922.2022.9747772.

Wang W, Li M. INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION. ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. 2022. p. 2739–2743.

Published In

ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

DOI

10.1109/ICASSP43922.2022.9747772

ISSN

1520-6149

Publication Date

January 1, 2022

Volume

2022-May

Start / End Page

2739 / 2743