Scholars@Duke publication: Vivid Background Audio Generation based on Large Language Models and AudioLDM

Vivid Background Audio Generation based on Large Language Models and AudioLDM

Publication , Conference

Liang, Y; Li, M

Published in: 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024

January 1, 2024

This paper describes a background audio and speech generation system for the Inspirational and Convincing Audio Generation Challenge 2024. Our system mainly includes three modules, namely, a text-to-speech (TTS), speech synthesis baseline, background text description extraction based on large language models, and the corresponding background audio generation based on latent diffusion. We compare the influence of text description extraction on the degree of correlation between background audio and its corresponding speech. At the same time, the results of different large language models on the background audio generated after description extraction are compared. We also propose an alternative evaluation metric named Overall Correlation Quality Score (OCQS) to evaluate the relevance and naturalness between speech text and its background audio. With the above evaluation metric, we test multiple models and find that the background audio generated by extracted speech text and summarized by large language models achieve better quality.

Duke Scholars

Author Ming Li DKU Faculty

Published In

2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024

DOI

10.1109/ISCSLP63861.2024.10800334

Publication Date

January 1, 2024

Start / End Page

621 / 625

Citation

APA

Chicago

ICMJE

MLA

NLM

Liang, Y., & Li, M. (2024). Vivid Background Audio Generation based on Large Language Models and AudioLDM. In 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024 (pp. 621–625). https://doi.org/10.1109/ISCSLP63861.2024.10800334

Liang, Y., and M. Li. “Vivid Background Audio Generation based on Large Language Models and AudioLDM.” In 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024, 621–25, 2024. https://doi.org/10.1109/ISCSLP63861.2024.10800334.

Liang Y, Li M. Vivid Background Audio Generation based on Large Language Models and AudioLDM. In: 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024. 2024. p. 621–5.

Liang, Y., and M. Li. “Vivid Background Audio Generation based on Large Language Models and AudioLDM.” 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024, 2024, pp. 621–25. Scopus, doi:10.1109/ISCSLP63861.2024.10800334.

Liang Y, Li M. Vivid Background Audio Generation based on Large Language Models and AudioLDM. 2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024. 2024. p. 621–625.

Published In

2024 14th International Symposium on Chinese Spoken Language Processing Iscslp 2024

DOI

10.1109/ISCSLP63861.2024.10800334

Publication Date

January 1, 2024

Start / End Page

621 / 625