Far-field end-to-end text-dependent speaker verification based on mixed training data with transfer learning and enrollment data augmentation
In this paper, we focus on the far-field end-to-end text-dependent speaker verification task with a small-scale far-field text dependent dataset and a large scale close-talking text independent database for training. First, we show that simulating far-field text independent data from the existing large-scale clean database for data augmentation can reduce the mismatch. Second, using a small far-field text dependent data set to fine-tune the deep speaker embedding model pre-trained from the simulated far-field as well as original clean text independent data can significantly improve the system performance. Third, in special applications when using the close-talking clean utterances for enrollment and employing the real far-field noisy utterances for testing, adding reverberant noises on the clean enrollment data can further enhance the system performance. We evaluate our methods on AISHELL ASR0009 and AISHELL 2019B-eval databases and achieve an equal error rate (EER) of 5.75% for far-field text-dependent speaker verification under noisy environments.