Scholars@Duke publication: On-the-Fly Data Loader and Utterance-Level Aggregation for Speaker and Language Recognition

On-the-Fly Data Loader and Utterance-Level Aggregation for Speaker and Language Recognition

Publication , Journal Article

Cai, W; Chen, J; Zhang, J; Li, M

Published in: IEEE ACM Transactions on Audio Speech and Language Processing

January 1, 2020

In this article, our recent efforts on directly modeling utterance-level aggregation for speaker and language recognition is summarized. First, an on-the-fly data loader for efficient network training is proposed. The data loader acts as a bridge between the full-length utterances and the network. It generates mini-batch samples on the fly, which allows batch-wise variable-length training and online data augmentation. Second, the traditional dictionary learning and Baum-Welch statistical accumulation mechanisms are applied to the network structure, and a learnable dictionary encoding (LDE) layer is introduced. The former accumulates discriminative statistics from the variable-length input sequence and outputs a single fixed-dimensional utterance-level representation. Experiments were conducted on four different datasets, namely NIST LRE 2007, AP17-OLR, SITW, and NIST SRE 2016. Experimental results show the effectiveness of the proposed batch-wise variable-length training with online data augmentation and the LDE layer, which significantly outperforms the baseline methods.

Duke Scholars

Author Ming Li DKU Faculty

Published In

IEEE ACM Transactions on Audio Speech and Language Processing

DOI

10.1109/TASLP.2020.2980991

EISSN

2329-9304

ISSN

2329-9290

Publication Date

January 1, 2020

Volume

Start / End Page

1038 / 1051

Citation

APA

Chicago

ICMJE

MLA

NLM

Cai, W., Chen, J., Zhang, J., & Li, M. (2020). On-the-Fly Data Loader and Utterance-Level Aggregation for Speaker and Language Recognition. IEEE ACM Transactions on Audio Speech and Language Processing, 28, 1038–1051. https://doi.org/10.1109/TASLP.2020.2980991

Cai, W., J. Chen, J. Zhang, and M. Li. “On-the-Fly Data Loader and Utterance-Level Aggregation for Speaker and Language Recognition.” IEEE ACM Transactions on Audio Speech and Language Processing 28 (January 1, 2020): 1038–51. https://doi.org/10.1109/TASLP.2020.2980991.

Cai W, Chen J, Zhang J, Li M. On-the-Fly Data Loader and Utterance-Level Aggregation for Speaker and Language Recognition. IEEE ACM Transactions on Audio Speech and Language Processing. 2020 Jan 1;28:1038–51.

Cai, W., et al. “On-the-Fly Data Loader and Utterance-Level Aggregation for Speaker and Language Recognition.” IEEE ACM Transactions on Audio Speech and Language Processing, vol. 28, Jan. 2020, pp. 1038–51. Scopus, doi:10.1109/TASLP.2020.2980991.

Published In

IEEE ACM Transactions on Audio Speech and Language Processing

DOI

10.1109/TASLP.2020.2980991

EISSN

2329-9304

ISSN

2329-9290

Publication Date

January 1, 2020

Volume

Start / End Page

1038 / 1051