Scholars@Duke publication: Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation

Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation

Publication , Conference

Lin, Y; Liu, D; Xu, Y; Suo, H; Li, M

Published in: 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024

January 1, 2024

Generating novel voices in speech synthesis is a challenging task with potential for creating versatile voices that are needed in entertainment and research. One of the primary obstacles in this area is the lack of well-annotated voice descriptions for expressive speech corpora. Our research aims to address this issue by representing speaker styles from vision. We introduce Stable Diffusion-Enhanced Voice Generation (SD-EVG), which leverages Stable Diffusion to generate imaginary facial images for new voice generation. To create a reference set of facial images based on realistic voices, SD-EVG employs a transformer encoder and a Stable Diffusion decoder to visualize the speaker’s face. Subsequently, SD-EVG uses a KNN-based approach to map facial features to speech style for voice generation. The experiments demonstrate that the voices generated from the imagined facial data have better potential at capturing speech style than text-based methods for the same descriptions.

Duke Scholars

Author Yueqian Lin

Author Ming Li DKU Faculty

Published In

2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024

DOI

10.1109/ISCSLP63861.2024.10800185

Publication Date

January 1, 2024

Start / End Page

229 / 233

Citation

APA

Chicago

ICMJE

MLA

NLM

Lin, Y., Liu, D., Xu, Y., Suo, H., & Li, M. (2024). Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation. In 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024 (pp. 229–233). https://doi.org/10.1109/ISCSLP63861.2024.10800185

Lin, Y., D. Liu, Y. Xu, H. Suo, and M. Li. “Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation.” In 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024, 229–33, 2024. https://doi.org/10.1109/ISCSLP63861.2024.10800185.

Lin Y, Liu D, Xu Y, Suo H, Li M. Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation. In: 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024. 2024. p. 229–33.

Lin, Y., et al. “Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation.” 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024, 2024, pp. 229–33. Scopus, doi:10.1109/ISCSLP63861.2024.10800185.

Lin Y, Liu D, Xu Y, Suo H, Li M. Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation. 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024. 2024. p. 229–233.

Published In

2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024

DOI

10.1109/ISCSLP63861.2024.10800185

Publication Date

January 1, 2024

Start / End Page

229 / 233