Skip to main content
Journal cover image

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

Publication ,  Journal Article
Zeng, C; Miao, X; Wang, X; Cooper, E; Yamagishi, J
Published in: Computer Speech and Language
June 1, 2024

Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back-end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized end-to-end (GE2E) model and NPLDA E2E model, most of these methods have not fully investigated how to model the intra-relationship between multiple enrollment utterances. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical scenario of multiple enrollment utterances. To leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame-level and utterance-level attention mechanisms. Additionally, focal loss is utilized to balance the importance of positive and negative samples within a mini-batch and focus on the difficult samples during the training process. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding-level mixup strategy for better optimization.

Duke Scholars

Published In

Computer Speech and Language

DOI

EISSN

1095-8363

ISSN

0885-2308

Publication Date

June 1, 2024

Volume

86

Related Subject Headings

  • Speech-Language Pathology & Audiology
  • 46 Information and computing sciences
  • 40 Engineering
  • 2004 Linguistics
  • 1702 Cognitive Sciences
  • 0801 Artificial Intelligence and Image Processing
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Zeng, C., Miao, X., Wang, X., Cooper, E., & Yamagishi, J. (2024). Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances. Computer Speech and Language, 86. https://doi.org/10.1016/j.csl.2024.101619
Zeng, C., X. Miao, X. Wang, E. Cooper, and J. Yamagishi. “Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances.” Computer Speech and Language 86 (June 1, 2024). https://doi.org/10.1016/j.csl.2024.101619.
Zeng C, Miao X, Wang X, Cooper E, Yamagishi J. Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances. Computer Speech and Language. 2024 Jun 1;86.
Zeng, C., et al. “Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances.” Computer Speech and Language, vol. 86, June 2024. Scopus, doi:10.1016/j.csl.2024.101619.
Zeng C, Miao X, Wang X, Cooper E, Yamagishi J. Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances. Computer Speech and Language. 2024 Jun 1;86.
Journal cover image

Published In

Computer Speech and Language

DOI

EISSN

1095-8363

ISSN

0885-2308

Publication Date

June 1, 2024

Volume

86

Related Subject Headings

  • Speech-Language Pathology & Audiology
  • 46 Information and computing sciences
  • 40 Engineering
  • 2004 Linguistics
  • 1702 Cognitive Sciences
  • 0801 Artificial Intelligence and Image Processing