Cross-lingual multi-speaker speech synthesis with limited bilingual training data
Modeling voices for multiple speakers and multiple languages with one speech synthesis system has been a challenge for a long time, especially in low-resource cases. This paper presents two approaches to achieve cross-lingual multi-speaker text-to-speech (TTS) and code-switching synthesis under two training scenarios: (1) cross-lingual synthesis with sufficient data, (2) cross-lingual synthesis with limited data per speaker. Accordingly, a novel TTS synthesis model and a non-autoregressive multi-speaker voice conversion model are proposed. The TTS model designed for sufficient-data cases has a Tacotron-based structure that uses shared phonemic representations associated with numeric language ID codes. As for the data-limited scenario, we adopt a framework cascading several speech modules to achieve our goal. In particular, we proposed a non-autoregressive many-to-many voice conversion module to address multi-speaker synthesis for data-insufficient cases. Experimental results on speaker similarity show that our proposed voice conversion module can maintain the voice characteristics well in data-limited cases. Both approaches use limited bilingual data and demonstrate impressive performance in cross-lingual synthesis, which can deliver fluent foreign speech and even code-switching speech for monolingual speakers.
Duke Scholars
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Related Subject Headings
- Speech-Language Pathology & Audiology
- 46 Information and computing sciences
- 40 Engineering
- 2004 Linguistics
- 1702 Cognitive Sciences
- 0801 Artificial Intelligence and Image Processing
Citation
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Related Subject Headings
- Speech-Language Pathology & Audiology
- 46 Information and computing sciences
- 40 Engineering
- 2004 Linguistics
- 1702 Cognitive Sciences
- 0801 Artificial Intelligence and Image Processing