Scholars@Duke publication: D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition.

D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition.

Publication , Journal Article

Miao, X; McLoughlin, I; Wang, W; Zhang, P

Published in: Neural networks : the official journal of the International Neural Network Society

July 2021

Attention-based convolutional neural network (CNN) models are increasingly being adopted for speaker and language recognition (SR/LR) tasks. These include time, frequency, spatial and channel attention, which can focus on useful time frames, frequency bands, regions or channels while extracting features. However, these traditional attention methods lack the exploration of complex information and multi-scale long-range speech feature interactions, which can benefit SR/LR tasks. To address these issues, this paper firstly proposes mixed-order attention (MOA) for low frame-level speech features to gain the finest grain multi-order information at higher resolution. We then combine that with a non-local attention (NLA) mechanism and a dilated residual structure to balance fine grained local detail with convolution from multi-scale long-range time/frequency regions in feature space. The proposed dilated mixed-order non-local attention network (D-MONA) exploits the detail available from the first and the second-order feature attention analysis, but achieves this over a much wider context than purely local attention. Experiments are conducted on three datasets, including two SR tasks of Voxceleb and CN-celeb, and one LR task, NIST LRE 07. For SR, D-MONA improves on ResNet-34 results by at least 29% and 15% for Voxceleb1 and CN-celeb respectively. For the LR task, a large improvement is achieved over ResNet-34 of 21% for the challenging 3s utterance condition, 59% for the 10s condition and 67% for the 30s condition. It also outperforms the state-of-the-art deep bottleneck feature-DNN (DBF-DNN) x-vector system at all scales.

Duke Scholars

Author Xiaoxiao Miao DKU Faculty

Published In

Neural networks : the official journal of the International Neural Network Society

DOI

10.1016/j.neunet.2021.03.014

EISSN

1879-2782

ISSN

0893-6080

Publication Date

July 2021

Volume

139

Start / End Page

201 / 211

Related Subject Headings

Speech Recognition Software
Neural Networks, Computer
Natural Language Processing
Artificial Intelligence & Image Processing
4905 Statistics
4611 Machine learning
4602 Artificial intelligence

Citation

APA

Chicago

ICMJE

MLA

NLM

Miao, X., McLoughlin, I., Wang, W., & Zhang, P. (2021). D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition. Neural Networks : The Official Journal of the International Neural Network Society, 139, 201–211. https://doi.org/10.1016/j.neunet.2021.03.014

Miao, Xiaoxiao, Ian McLoughlin, Wenchao Wang, and Pengyuan Zhang. “D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition.” Neural Networks : The Official Journal of the International Neural Network Society 139 (July 2021): 201–11. https://doi.org/10.1016/j.neunet.2021.03.014.

Miao X, McLoughlin I, Wang W, Zhang P. D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition. Neural networks : the official journal of the International Neural Network Society. 2021 Jul;139:201–11.

Miao, Xiaoxiao, et al. “D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition.” Neural Networks : The Official Journal of the International Neural Network Society, vol. 139, July 2021, pp. 201–11. Epmc, doi:10.1016/j.neunet.2021.03.014.

Published In

Neural networks : the official journal of the International Neural Network Society

DOI

10.1016/j.neunet.2021.03.014

EISSN

1879-2782

ISSN

0893-6080

Publication Date

July 2021

Volume

139

Start / End Page

201 / 211

Related Subject Headings

Speech Recognition Software
Neural Networks, Computer
Natural Language Processing
Artificial Intelligence & Image Processing
4905 Statistics
4611 Machine learning
4602 Artificial intelligence