Skip to main content

Vivid Background Audio Generation based on Large Language Models and AudioLDM

Publication ,  Conference
Liang, Y; Li, M
Published in: 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024
January 1, 2024

This paper describes a background audio and speech generation system for the Inspirational and Convincing Audio Generation Challenge 2024. Our system mainly includes three modules, namely, a text-to-speech (TTS), speech synthesis baseline, background text description extraction based on large language models, and the corresponding background audio generation based on latent diffusion. We compare the influence of text description extraction on the degree of correlation between background audio and its corresponding speech. At the same time, the results of different large language models on the background audio generated after description extraction are compared. We also propose an alternative evaluation metric named Overall Correlation Quality Score (OCQS) to evaluate the relevance and naturalness between speech text and its background audio. With the above evaluation metric, we test multiple models and find that the background audio generated by extracted speech text and summarized by large language models achieve better quality.

Duke Scholars

Published In

2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024

DOI

Publication Date

January 1, 2024

Start / End Page

621 / 625
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Liang, Y., & Li, M. (2024). Vivid Background Audio Generation based on Large Language Models and AudioLDM. In 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024 (pp. 621–625). https://doi.org/10.1109/ISCSLP63861.2024.10800334
Liang, Y., and M. Li. “Vivid Background Audio Generation based on Large Language Models and AudioLDM.” In 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024, 621–25, 2024. https://doi.org/10.1109/ISCSLP63861.2024.10800334.
Liang Y, Li M. Vivid Background Audio Generation based on Large Language Models and AudioLDM. In: 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024. 2024. p. 621–5.
Liang, Y., and M. Li. “Vivid Background Audio Generation based on Large Language Models and AudioLDM.” 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024, 2024, pp. 621–25. Scopus, doi:10.1109/ISCSLP63861.2024.10800334.
Liang Y, Li M. Vivid Background Audio Generation based on Large Language Models and AudioLDM. 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024. 2024. p. 621–625.

Published In

2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024

DOI

Publication Date

January 1, 2024

Start / End Page

621 / 625