Robust Multi-Channel Far-Field Speaker Verification under Different In-Domain Data Availability Scenarios
The popularity and application of smart home devices have made far-field speaker verification an urgent need. However, speaker verification performance is unsatisfactory under far-field environments despite its significant improvements enabled by deep neural networks (DNN). In this paper, we summarize our previous work and propose multiple training strategies and models for multi-channel far-field speaker verification with different in-domain data availability scenarios. The experiments are conducted on the FFSVC20 dataset, and we proposed the cross-device and cross-domain trials. We focus on single-channel and multi-channel speaker verification training based on the dataset. For single-channel speaker verification, considering the size of training data and availability of labels, we introduce three training scenarios and given our proposed training methods, including 1) given zero out-of-domain data and few in-domain labeled data; 2) given large-scale out-of-domain labeled data and few in-domain labeled data; 3) given large-scale out-of-domain labeled data and few in-domain unlabeled data. To this end, we propose a meta-learning approach, refined transfer learning methods, and semi-supervised learning for three scenarios, respectively. For multi-channel speaker verification, we first introduce two types of 3 dimension convolution (3D Conv) residual network (ResNet) models proposed in our previous works, including fully 3D ResNet and incorporating 3D Conv with 2D Conv ResNet (3D2D-ResNet). In this paper, we propose channel-wise 3D squeeze-and-excitation ResNet (C3DSE-ResNet) and spatial-wise 3D SE ResNet (S3DSE-ResNet) to further explore the channel dependencies and improve the 3D ConvNet performance. The results show that the proposed strategies and models can significantly boost performance under the far-field scenario.