Scholars@Duke publication: Leveraging ASR Pretrained Conformers for Speaker Verification Through Transfer Learning and Knowledge Distillation

Leveraging ASR Pretrained Conformers for Speaker Verification Through Transfer Learning and Knowledge Distillation

Publication , Journal Article

Cai, D; Li, M

Published in: IEEE/ACM Transactions on Audio Speech and Language Processing

January 1, 2024

—This paper focuses on the application of Conformers in speaker verification. Conformers, initially designed for Automatic Speech Recognition (ASR), excel at modeling both local and global contexts within speech signals effectively. Building on this synergistic relationship, this study introduces three strategies for leveraging ASR-pretrained Conformers in speaker verification: (1) Transfer learning: We use a pretrained ASR Conformer encoder to initialize the speaker embedding network, thereby enhancing model generalization and mitigating the risk of overfitting. (2) Knowledge distillation: We distill the complex capabilities of an ASR Conformer into a speaker verification model. This not only allows for flexibility in the student mode’s network architecture but also incorporates frame-level ASR distillation loss as an auxiliary task to reinforce speaker verification. (3) Parameter-efficient transfer learning with speaker adaptation: A lightweight speaker adaptation module is proposed to convert ASR-derived features into speaker-specific embeddings, without altering the core architecture of the original ASR Conformer. This strategy facilitates the concurrent execution of ASR and speaker verification tasks within a singular model. Experiments were conducted on VoxCeleb datasets. The best model using the ASR pretraining method achieved a 0.43% equal error rate (EER) on the VoxCeleb1-O test trial, while the knowledge distillation approach yielded a 0.38% EER. Furthermore, by adding a mere 4.92 million parameters to a 130.94 million-parameter ASR Conformer encoder, the speaker adaptation approach achieved a 0.45% EER, enabling parallel speech recognition and speaker verification within a single ASR Conformer encoder. Overall, our techniques successfully transfer rich ASR knowledge to advanced speaker modeling.

Duke Scholars

Author Ming Li DKU Faculty

Altmetric Attention Stats

Dimensions Citation Stats

Published In

IEEE/ACM Transactions on Audio Speech and Language Processing

DOI

10.1109/TASLP.2024.3419426

EISSN

2329-9304

ISSN

2329-9290

Publication Date

January 1, 2024

Volume

Start / End Page

3532 / 3545

Citation

APA

Chicago

ICMJE

MLA

NLM

Cai, D., & Li, M. (2024). Leveraging ASR Pretrained Conformers for Speaker Verification Through Transfer Learning and Knowledge Distillation. IEEE/ACM Transactions on Audio Speech and Language Processing, 32, 3532–3545. https://doi.org/10.1109/TASLP.2024.3419426

Cai, D., and M. Li. “Leveraging ASR Pretrained Conformers for Speaker Verification Through Transfer Learning and Knowledge Distillation.” IEEE/ACM Transactions on Audio Speech and Language Processing 32 (January 1, 2024): 3532–45. https://doi.org/10.1109/TASLP.2024.3419426.

Cai D, Li M. Leveraging ASR Pretrained Conformers for Speaker Verification Through Transfer Learning and Knowledge Distillation. IEEE/ACM Transactions on Audio Speech and Language Processing. 2024 Jan 1;32:3532–45.

Cai, D., and M. Li. “Leveraging ASR Pretrained Conformers for Speaker Verification Through Transfer Learning and Knowledge Distillation.” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 32, Jan. 2024, pp. 3532–45. Scopus, doi:10.1109/TASLP.2024.3419426.

Published In

IEEE/ACM Transactions on Audio Speech and Language Processing

DOI

10.1109/TASLP.2024.3419426

EISSN

2329-9304

ISSN

2329-9290

Publication Date

January 1, 2024

Volume

Start / End Page

3532 / 3545