Leveraging ASR Pretrained Conformers for Speaker Verification Through Transfer Learning and Knowledge Distillation
—This paper focuses on the application of Conformers in speaker verification. Conformers, initially designed for Automatic Speech Recognition (ASR), excel at modeling both local and global contexts within speech signals effectively. Building on this synergistic relationship, this study introduces three strategies for leveraging ASR-pretrained Conformers in speaker verification: (1) Transfer learning: We use a pretrained ASR Conformer encoder to initialize the speaker embedding network, thereby enhancing model generalization and mitigating the risk of overfitting. (2) Knowledge distillation: We distill the complex capabilities of an ASR Conformer into a speaker verification model. This not only allows for flexibility in the student mode’s network architecture but also incorporates frame-level ASR distillation loss as an auxiliary task to reinforce speaker verification. (3) Parameter-efficient transfer learning with speaker adaptation: A lightweight speaker adaptation module is proposed to convert ASR-derived features into speaker-specific embeddings, without altering the core architecture of the original ASR Conformer. This strategy facilitates the concurrent execution of ASR and speaker verification tasks within a singular model. Experiments were conducted on VoxCeleb datasets. The best model using the ASR pretraining method achieved a 0.43% equal error rate (EER) on the VoxCeleb1-O test trial, while the knowledge distillation approach yielded a 0.38% EER. Furthermore, by adding a mere 4.92 million parameters to a 130.94 million-parameter ASR Conformer encoder, the speaker adaptation approach achieved a 0.45% EER, enabling parallel speech recognition and speaker verification within a single ASR Conformer encoder. Overall, our techniques successfully transfer rich ASR knowledge to advanced speaker modeling.