Mandarin electrolaryngeal voice conversion with combination of Gaussian mixture model and non-negative matrix factorization
Electrolarynx (EL) is a speaking-aid device that helps laryngectomees who have their larynx removed to generate voice. However, the voice generated by EL is unnatural and unintelligible due to its flat pitch and strong vibration noise. Targeting these challenges, previous works show that the electrolaryngeal speech can be enhanced using Gaussian Mixture Model (GMM) based voice conversion (VC). Although effective in improving the naturalness, it degrades the intelligibility of the converted speech. To address this issue, we propose a hybrid approach using both Non-negative Matrix Factorization (NMF) and GMM methods. For better intelligibility, we apply the NMF to estimate the high quality spectral features. For better naturalness, we use the GMM with dynamic trajectory constraint to recover a smoothed F0. Additionally, to suppress the EL vibration noise, we include the 0th MCC coefficient in the GMM-based VC. The proposed method significantly increases the F0 dynamic range, reduces vibration noise, and improves both speech naturalness and intelligibility. One hundred pairs of the normal and electrolaryngeal speech in daily mandarin are recorded as our evaluation data. Experimental results show that our proposed hybrid method reduces the mel-cepstral distortion by 7.1 dB and increases the F0 correlation coefficient to 0.54.