Scholars@Duke publication: VOXBLINK: A LARGE SCALE SPEAKER VERIFICATION DATASET ON CAMERA

VOXBLINK: A LARGE SCALE SPEAKER VERIFICATION DATASET ON CAMERA

Publication , Conference

Lin, Y; Qin, X; Zhao, G; Cheng, M; Jiang, N; Wu, H; Li, M

Published in: ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

January 1, 2024

In this paper, we introduce a large-scale and high-quality audiovisual speaker verification dataset, named VoxBlink. We propose an innovative and robust automatic audio-visual data mining pipeline to curate this dataset, which contains 1.45M utterances from 38K speakers. Due to the inherent nature of automated data collection, introducing noisy data is inevitable. Therefore, we also utilize a multi-modal purification step to generate a cleaner version of the VoxBlink, named VoxBlink-clean, comprising 18K identities and 1.02M utterances. In contrast to the VoxCeleb, the VoxBlink sources from short videos of ordinary users, and the covered scenarios can better align with real-life situations. To our best knowledge, the VoxBlink dataset is one of the largest publicly available speaker verification datasets. Leveraging the VoxCeleb and VoxBlink-clean datasets together, we employ diverse speaker verification models with multiple architectural backbones to conduct comprehensive evaluations on the VoxCeleb test sets. Experimental results indicate a substantial enhancement in performance—ranging from 12% to 30% relatively—across various backbone architectures upon incorporating the VoxBlink-clean into the training process. The details of the dataset can be found on Site.

Duke Scholars

Author Ming Li DKU Faculty

Published In

ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

DOI

10.1109/ICASSP48485.2024.10446780

ISSN

1520-6149

Publication Date

January 1, 2024

Start / End Page

10271 / 10275

Citation

APA

Chicago

ICMJE

MLA

NLM

Lin, Y., Qin, X., Zhao, G., Cheng, M., Jiang, N., Wu, H., & Li, M. (2024). VOXBLINK: A LARGE SCALE SPEAKER VERIFICATION DATASET ON CAMERA. In ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings (pp. 10271–10275). https://doi.org/10.1109/ICASSP48485.2024.10446780

Lin, Y., X. Qin, G. Zhao, M. Cheng, N. Jiang, H. Wu, and M. Li. “VOXBLINK: A LARGE SCALE SPEAKER VERIFICATION DATASET ON CAMERA.” In ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 10271–75, 2024. https://doi.org/10.1109/ICASSP48485.2024.10446780.

Lin Y, Qin X, Zhao G, Cheng M, Jiang N, Wu H, et al. VOXBLINK: A LARGE SCALE SPEAKER VERIFICATION DATASET ON CAMERA. In: ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. 2024. p. 10271–5.

Lin, Y., et al. “VOXBLINK: A LARGE SCALE SPEAKER VERIFICATION DATASET ON CAMERA.” ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2024, pp. 10271–75. Scopus, doi:10.1109/ICASSP48485.2024.10446780.

Lin Y, Qin X, Zhao G, Cheng M, Jiang N, Wu H, Li M. VOXBLINK: A LARGE SCALE SPEAKER VERIFICATION DATASET ON CAMERA. ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. 2024. p. 10271–10275.

Published In

ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

DOI

10.1109/ICASSP48485.2024.10446780

ISSN

1520-6149

Publication Date

January 1, 2024

Start / End Page

10271 / 10275