Skip to main content

Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation

Publication ,  Conference
Lin, Y; Liu, D; Xu, Y; Suo, H; Li, M
Published in: 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024
January 1, 2024

Generating novel voices in speech synthesis is a challenging task with potential for creating versatile voices that are needed in entertainment and research. One of the primary obstacles in this area is the lack of well-annotated voice descriptions for expressive speech corpora. Our research aims to address this issue by representing speaker styles from vision. We introduce Stable Diffusion-Enhanced Voice Generation (SD-EVG), which leverages Stable Diffusion to generate imaginary facial images for new voice generation. To create a reference set of facial images based on realistic voices, SD-EVG employs a transformer encoder and a Stable Diffusion decoder to visualize the speaker’s face. Subsequently, SD-EVG uses a KNN-based approach to map facial features to speech style for voice generation. The experiments demonstrate that the voices generated from the imagined facial data have better potential at capturing speech style than text-based methods for the same descriptions.

Duke Scholars

Published In

2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024

DOI

Publication Date

January 1, 2024

Start / End Page

229 / 233
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Lin, Y., Liu, D., Xu, Y., Suo, H., & Li, M. (2024). Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation. In 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024 (pp. 229–233). https://doi.org/10.1109/ISCSLP63861.2024.10800185
Lin, Y., D. Liu, Y. Xu, H. Suo, and M. Li. “Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation.” In 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024, 229–33, 2024. https://doi.org/10.1109/ISCSLP63861.2024.10800185.
Lin Y, Liu D, Xu Y, Suo H, Li M. Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation. In: 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024. 2024. p. 229–33.
Lin, Y., et al. “Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation.” 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024, 2024, pp. 229–33. Scopus, doi:10.1109/ISCSLP63861.2024.10800185.
Lin Y, Liu D, Xu Y, Suo H, Li M. Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation. 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024. 2024. p. 229–233.

Published In

2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024

DOI

Publication Date

January 1, 2024

Start / End Page

229 / 233