Skip to main content

Online Neural Speaker Diarization With Target Speaker Tracking

Publication ,  Journal Article
Wang, W; Li, M
Published in: IEEE ACM Transactions on Audio Speech and Language Processing
January 1, 2024

This paper proposes an online target speaker voice activity detection (TS-VAD) system for speaker diarization tasks that does not rely on prior knowledge from clustering-based diarization systems to obtain target speaker embeddings. By adapting conventional TS-VAD for real-time operation, our framework identifies speaker activities using self-generated embeddings, ensuring consistent performance and avoiding permutation inconsistencies during inference. In the inference phase, we employ a front-end model to extract frame-level speaker embeddings for each incoming signal block. Subsequently, we predict each speaker's detection state based on these frame-level embeddings and the previously estimated target speaker embeddings. The target speaker embeddings are then updated by aggregating the frame-level embeddings according to the current block's predictions. Our model predicts results block-by-block and iteratively updates target speaker embeddings until reaching the end of the signal. Experimental results demonstrate that the proposed method outperforms offline clustering-based diarization systems on the DIHARD III and AliMeeting datasets. Additionally, this approach is extended to multi-channel data, achieving comparable performance to state-of-the-art offline diarization systems.

Duke Scholars

Published In

IEEE ACM Transactions on Audio Speech and Language Processing

DOI

EISSN

2329-9304

ISSN

2329-9290

Publication Date

January 1, 2024

Volume

32

Start / End Page

5078 / 5091
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Wang, W., & Li, M. (2024). Online Neural Speaker Diarization With Target Speaker Tracking. IEEE ACM Transactions on Audio Speech and Language Processing, 32, 5078–5091. https://doi.org/10.1109/TASLP.2024.3507559
Wang, W., and M. Li. “Online Neural Speaker Diarization With Target Speaker Tracking.” IEEE ACM Transactions on Audio Speech and Language Processing 32 (January 1, 2024): 5078–91. https://doi.org/10.1109/TASLP.2024.3507559.
Wang W, Li M. Online Neural Speaker Diarization With Target Speaker Tracking. IEEE ACM Transactions on Audio Speech and Language Processing. 2024 Jan 1;32:5078–91.
Wang, W., and M. Li. “Online Neural Speaker Diarization With Target Speaker Tracking.” IEEE ACM Transactions on Audio Speech and Language Processing, vol. 32, Jan. 2024, pp. 5078–91. Scopus, doi:10.1109/TASLP.2024.3507559.
Wang W, Li M. Online Neural Speaker Diarization With Target Speaker Tracking. IEEE ACM Transactions on Audio Speech and Language Processing. 2024 Jan 1;32:5078–5091.

Published In

IEEE ACM Transactions on Audio Speech and Language Processing

DOI

EISSN

2329-9304

ISSN

2329-9290

Publication Date

January 1, 2024

Volume

32

Start / End Page

5078 / 5091