Scholars@Duke publication: Text-independent voice conversion using deep neural network based phonetic level features

Text-independent voice conversion using deep neural network based phonetic level features

Publication , Conference

Zheng, H; Cai, W; Zhou, T; Zhang, S; Li, M

Published in: Proceedings - International Conference on Pattern Recognition

January 1, 2016

This paper presents a phonetically-aware joint density Gaussian mixture model (JD-GMM) framework for voice conversion that no longer requires parallel data from source speaker at the training stage. Considering that the phonetic level features contain text information which should be preserved in the conversion task, we propose a method that only concatenates phonetic discriminant features and spectral features extracted from the same target speakers speech to train a JD-GMM. After the mapping relationship of these two features is trained, we can use phonetic discriminant features from source speaker to estimate target speaker's spectral features at conversion stage. The phonetic discriminant features are extracted using PCA from the output layer of a deep neural network (DNN) in an automatic speaker recognition (ASR) system. It can be seen as a low dimensional representation of the senone posteriors. We compare the proposed phonetically-aware method with conventional JD-GMM method on the Voice Conversion Challenge 2016 training database. The experimental results show that our proposed phonetically-aware feature method can obtain similar performance compared to the conventional JD-GMM in the case of using only target speech as training data.

Duke Scholars

Author Ming Li DKU Faculty

Published In

Proceedings - International Conference on Pattern Recognition

DOI

10.1109/ICPR.2016.7900072

ISSN

1051-4651

Publication Date

January 1, 2016

Volume

Start / End Page

2872 / 2877

Citation

APA

Chicago

ICMJE

MLA

NLM

Zheng, H., Cai, W., Zhou, T., Zhang, S., & Li, M. (2016). Text-independent voice conversion using deep neural network based phonetic level features. In Proceedings - International Conference on Pattern Recognition (Vol. 0, pp. 2872–2877). https://doi.org/10.1109/ICPR.2016.7900072

Zheng, H., W. Cai, T. Zhou, S. Zhang, and M. Li. “Text-independent voice conversion using deep neural network based phonetic level features.” In Proceedings - International Conference on Pattern Recognition, 0:2872–77, 2016. https://doi.org/10.1109/ICPR.2016.7900072.

Zheng H, Cai W, Zhou T, Zhang S, Li M. Text-independent voice conversion using deep neural network based phonetic level features. In: Proceedings - International Conference on Pattern Recognition. 2016. p. 2872–7.

Zheng, H., et al. “Text-independent voice conversion using deep neural network based phonetic level features.” Proceedings - International Conference on Pattern Recognition, vol. 0, 2016, pp. 2872–77. Scopus, doi:10.1109/ICPR.2016.7900072.

Zheng H, Cai W, Zhou T, Zhang S, Li M. Text-independent voice conversion using deep neural network based phonetic level features. Proceedings - International Conference on Pattern Recognition. 2016. p. 2872–2877.

Published In

Proceedings - International Conference on Pattern Recognition

DOI

10.1109/ICPR.2016.7900072

ISSN

1051-4651

Publication Date

January 1, 2016

Volume

Start / End Page

2872 / 2877