Skip to main content
Journal cover image

D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition.

Publication ,  Journal Article
Miao, X; McLoughlin, I; Wang, W; Zhang, P
Published in: Neural networks : the official journal of the International Neural Network Society
July 2021

Attention-based convolutional neural network (CNN) models are increasingly being adopted for speaker and language recognition (SR/LR) tasks. These include time, frequency, spatial and channel attention, which can focus on useful time frames, frequency bands, regions or channels while extracting features. However, these traditional attention methods lack the exploration of complex information and multi-scale long-range speech feature interactions, which can benefit SR/LR tasks. To address these issues, this paper firstly proposes mixed-order attention (MOA) for low frame-level speech features to gain the finest grain multi-order information at higher resolution. We then combine that with a non-local attention (NLA) mechanism and a dilated residual structure to balance fine grained local detail with convolution from multi-scale long-range time/frequency regions in feature space. The proposed dilated mixed-order non-local attention network (D-MONA) exploits the detail available from the first and the second-order feature attention analysis, but achieves this over a much wider context than purely local attention. Experiments are conducted on three datasets, including two SR tasks of Voxceleb and CN-celeb, and one LR task, NIST LRE 07. For SR, D-MONA improves on ResNet-34 results by at least 29% and 15% for Voxceleb1 and CN-celeb respectively. For the LR task, a large improvement is achieved over ResNet-34 of 21% for the challenging 3s utterance condition, 59% for the 10s condition and 67% for the 30s condition. It also outperforms the state-of-the-art deep bottleneck feature-DNN (DBF-DNN) x-vector system at all scales.

Duke Scholars

Published In

Neural networks : the official journal of the International Neural Network Society

DOI

EISSN

1879-2782

ISSN

0893-6080

Publication Date

July 2021

Volume

139

Start / End Page

201 / 211

Related Subject Headings

  • Speech Recognition Software
  • Neural Networks, Computer
  • Natural Language Processing
  • Artificial Intelligence & Image Processing
  • 4905 Statistics
  • 4611 Machine learning
  • 4602 Artificial intelligence
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Miao, X., McLoughlin, I., Wang, W., & Zhang, P. (2021). D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition. Neural Networks : The Official Journal of the International Neural Network Society, 139, 201–211. https://doi.org/10.1016/j.neunet.2021.03.014
Miao, Xiaoxiao, Ian McLoughlin, Wenchao Wang, and Pengyuan Zhang. “D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition.Neural Networks : The Official Journal of the International Neural Network Society 139 (July 2021): 201–11. https://doi.org/10.1016/j.neunet.2021.03.014.
Miao X, McLoughlin I, Wang W, Zhang P. D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition. Neural networks : the official journal of the International Neural Network Society. 2021 Jul;139:201–11.
Miao, Xiaoxiao, et al. “D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition.Neural Networks : The Official Journal of the International Neural Network Society, vol. 139, July 2021, pp. 201–11. Epmc, doi:10.1016/j.neunet.2021.03.014.
Miao X, McLoughlin I, Wang W, Zhang P. D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition. Neural networks : the official journal of the International Neural Network Society. 2021 Jul;139:201–211.
Journal cover image

Published In

Neural networks : the official journal of the International Neural Network Society

DOI

EISSN

1879-2782

ISSN

0893-6080

Publication Date

July 2021

Volume

139

Start / End Page

201 / 211

Related Subject Headings

  • Speech Recognition Software
  • Neural Networks, Computer
  • Natural Language Processing
  • Artificial Intelligence & Image Processing
  • 4905 Statistics
  • 4611 Machine learning
  • 4602 Artificial intelligence