Skip to main content

Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition

Publication ,  Journal Article
Cai, D; Wang, W; Li, M
Published in: IEEE/ACM Transactions on Audio Speech and Language Processing
January 1, 2022

The currentsuccess of deep learning largely benefits from the availability of large amount of labeled data. However, collecting a large-scale dataset with human annotation can be expensive and sometimes difficult. Self-supervised learning thus attracts many research interests to train models without labels. In this paper, we propose a self-supervised learning framework for speaker recognition. Combining clustering with deep representation learning, the proposed framework generates pseudo labels for the unlabeled dataset and learns speaker representation without human annotation. Our method starts with training a speaker representation encoder with contrastive self-supervised learning. Clustering on the learned representation generates pseudo labels, which are used as the supervisory signal for the subsequent training of the representation encoder. The clustering and representation learning process is performed iteratively to bootstrap the discriminative power of the deep neural network. We apply this self-supervised learning framework to both single modal audio data and multi-modal audio-visual data. For audio-visual data, audio and visual representation encoders are employed to learn representations of the corresponding modality. A cluster ensemble algorithm is then used to fuse the clustering results of the two modalities. The complementary information in multi-modalities ensures a robust and fault-tolerant supervisory signal for audio and visual representation learning. Experimental results show that our proposed iterative self-supervised learning framework outperforms previous works with self-supervision by large margins. Training with single modal audio data on the development set of VoxCeleb 2, our proposed framework achieves an equal error rate (EER) of 2.8% on the original test trials of VoxCeleb 1. When training with additional visual modality, the EER further reduces to 1.8%, which is only 20% higher than the fully supervised audio-based system with an EER of 1.5%. Also, experimental analysis shows that the proposed framework generates pseudolabels that are highly correlated to ground truth labels.

Duke Scholars

Published In

IEEE/ACM Transactions on Audio Speech and Language Processing

DOI

EISSN

2329-9304

ISSN

2329-9290

Publication Date

January 1, 2022

Volume

30

Start / End Page

1422 / 1435
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Cai, D., Wang, W., & Li, M. (2022). Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 30, 1422–1435. https://doi.org/10.1109/TASLP.2022.3162078
Cai, D., W. Wang, and M. Li. “Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition.” IEEE/ACM Transactions on Audio Speech and Language Processing 30 (January 1, 2022): 1422–35. https://doi.org/10.1109/TASLP.2022.3162078.
Cai D, Wang W, Li M. Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition. IEEE/ACM Transactions on Audio Speech and Language Processing. 2022 Jan 1;30:1422–35.
Cai, D., et al. “Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition.” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 30, Jan. 2022, pp. 1422–35. Scopus, doi:10.1109/TASLP.2022.3162078.
Cai D, Wang W, Li M. Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition. IEEE/ACM Transactions on Audio Speech and Language Processing. 2022 Jan 1;30:1422–1435.

Published In

IEEE/ACM Transactions on Audio Speech and Language Processing

DOI

EISSN

2329-9304

ISSN

2329-9290

Publication Date

January 1, 2022

Volume

30

Start / End Page

1422 / 1435