Scholars@Duke publication: Deep Speaker Embeddings with Convolutional Neural Network on Supervector for Text-Independent Speaker Recognition

Deep Speaker Embeddings with Convolutional Neural Network on Supervector for Text-Independent Speaker Recognition

Publication , Conference

Cai, D; Cai, Z; Li, M

Published in: 2018 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2018 Proceedings

July 2, 2018

Published version (DOI)

Lexical content variability in different utterances is the key challenge for text-independent speaker verification. In this paper, we investigate using supervector which has ability to reduce the impact of lexical content mismatch among different utterances for supervised speaker embedding learning. A DNN acoustic model is used to align a feature sequence to a set of senones and generate centered and normalized first order statistics supervector. Statistics vectors from similar senones are placed together and reshaped to an image to maintain the local continuity and correlation. The supervector image is then fed into residual convolutional neural network. The deep speaker embedding features are the outputs of the last hidden layer of the network and we employ a PLDA back-end for the subsequent modeling. Experimental results show that the proposed method outperforms the conventional GMM-UBM i-vector system and is complementary to the DNN-UBM i-vector system. The score level fusion system achieves 1.26% ERR and 0.260 DCF10 cost on the NIST SRE 10 extended core condition 5 task.

Duke Scholars

Author Ming Li DKU Faculty

Published In

2018 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2018 Proceedings

DOI

10.23919/APSIPA.2018.8659595

Publication Date

July 2, 2018

Start / End Page

1478 / 1482

Citation

APA

Chicago

ICMJE

MLA

NLM

Cai, D., Cai, Z., & Li, M. (2018). Deep Speaker Embeddings with Convolutional Neural Network on Supervector for Text-Independent Speaker Recognition. In 2018 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2018 Proceedings (pp. 1478–1482). https://doi.org/10.23919/APSIPA.2018.8659595

Cai, D., Z. Cai, and M. Li. “Deep Speaker Embeddings with Convolutional Neural Network on Supervector for Text-Independent Speaker Recognition.” In 2018 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2018 Proceedings, 1478–82, 2018. https://doi.org/10.23919/APSIPA.2018.8659595.

Cai D, Cai Z, Li M. Deep Speaker Embeddings with Convolutional Neural Network on Supervector for Text-Independent Speaker Recognition. In: 2018 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2018 Proceedings. 2018. p. 1478–82.

Cai, D., et al. “Deep Speaker Embeddings with Convolutional Neural Network on Supervector for Text-Independent Speaker Recognition.” 2018 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2018 Proceedings, 2018, pp. 1478–82. Scopus, doi:10.23919/APSIPA.2018.8659595.

Cai D, Cai Z, Li M. Deep Speaker Embeddings with Convolutional Neural Network on Supervector for Text-Independent Speaker Recognition. 2018 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2018 Proceedings. 2018. p. 1478–1482.

Published In

2018 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2018 Proceedings

DOI

10.23919/APSIPA.2018.8659595

Publication Date

July 2, 2018

Start / End Page

1478 / 1482