Scholars@Duke publication: Online Neural Speaker Diarization With Target Speaker Tracking

Online Neural Speaker Diarization With Target Speaker Tracking

Publication , Journal Article

Wang, W; Li, M

Published in: IEEE ACM Transactions on Audio Speech and Language Processing

January 1, 2024

This paper proposes an online target speaker voice activity detection (TS-VAD) system for speaker diarization tasks that does not rely on prior knowledge from clustering-based diarization systems to obtain target speaker embeddings. By adapting conventional TS-VAD for real-time operation, our framework identifies speaker activities using self-generated embeddings, ensuring consistent performance and avoiding permutation inconsistencies during inference. In the inference phase, we employ a front-end model to extract frame-level speaker embeddings for each incoming signal block. Subsequently, we predict each speaker's detection state based on these frame-level embeddings and the previously estimated target speaker embeddings. The target speaker embeddings are then updated by aggregating the frame-level embeddings according to the current block's predictions. Our model predicts results block-by-block and iteratively updates target speaker embeddings until reaching the end of the signal. Experimental results demonstrate that the proposed method outperforms offline clustering-based diarization systems on the DIHARD III and AliMeeting datasets. Additionally, this approach is extended to multi-channel data, achieving comparable performance to state-of-the-art offline diarization systems.

Duke Scholars

Author Ming Li DKU Faculty

Published In

IEEE ACM Transactions on Audio Speech and Language Processing

DOI

10.1109/TASLP.2024.3507559

EISSN

2329-9304

ISSN

2329-9290

Publication Date

January 1, 2024

Volume

Start / End Page

5078 / 5091

Citation

APA

Chicago

ICMJE

MLA

NLM

Wang, W., & Li, M. (2024). Online Neural Speaker Diarization With Target Speaker Tracking. IEEE ACM Transactions on Audio Speech and Language Processing, 32, 5078–5091. https://doi.org/10.1109/TASLP.2024.3507559

Wang, W., and M. Li. “Online Neural Speaker Diarization With Target Speaker Tracking.” IEEE ACM Transactions on Audio Speech and Language Processing 32 (January 1, 2024): 5078–91. https://doi.org/10.1109/TASLP.2024.3507559.

Wang W, Li M. Online Neural Speaker Diarization With Target Speaker Tracking. IEEE ACM Transactions on Audio Speech and Language Processing. 2024 Jan 1;32:5078–91.

Wang, W., and M. Li. “Online Neural Speaker Diarization With Target Speaker Tracking.” IEEE ACM Transactions on Audio Speech and Language Processing, vol. 32, Jan. 2024, pp. 5078–91. Scopus, doi:10.1109/TASLP.2024.3507559.

Wang W, Li M. Online Neural Speaker Diarization With Target Speaker Tracking. IEEE ACM Transactions on Audio Speech and Language Processing. 2024 Jan 1;32:5078–5091.

Published In

IEEE ACM Transactions on Audio Speech and Language Processing

DOI

10.1109/TASLP.2024.3507559

EISSN

2329-9304

ISSN

2329-9290

Publication Date

January 1, 2024

Volume

Start / End Page

5078 / 5091