Vivid Background Audio Generation based on Large Language Models and AudioLDM
This paper describes a background audio and speech generation system for the Inspirational and Convincing Audio Generation Challenge 2024. Our system mainly includes three modules, namely, a text-to-speech (TTS), speech synthesis baseline, background text description extraction based on large language models, and the corresponding background audio generation based on latent diffusion. We compare the influence of text description extraction on the degree of correlation between background audio and its corresponding speech. At the same time, the results of different large language models on the background audio generated after description extraction are compared. We also propose an alternative evaluation metric named Overall Correlation Quality Score (OCQS) to evaluate the relevance and naturalness between speech text and its background audio. With the above evaluation metric, we test multiple models and find that the background audio generated by extracted speech text and summarized by large language models achieve better quality.