Skip to main content

Efficient Video to Audio Mapper with Visual Scene Detection

Publication ,  Conference
Yi, M; Wang, Y; Li, M
Published in: 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025
January 1, 2025

Video-to-audio (V2A) generation aims to produce corresponding audio given silent video inputs. This task is particularly challenging due to the cross-modality and sequential nature of the audio-visual features involved. Recent works have made significant progress in bridging the domain gap between video and audio, generating audio that is semantically aligned with the video content. However, a critical limitation of these approaches is their inability to effectively recognize and handle multiple scenes within a video, often leading to suboptimal audio generation in such cases. In this paper, we first reimplement a state-of-the-art V2A model with a slightly modified light-weight architecture, outperforming the baseline. We then propose an improved V2A model that incorporates a scene detector to address the challenge of switching between multiple visual scenes. Results on VGGSound show that our model can recognize and handle multiple scenes within a video and achieve superior performance against the baseline for both fidelity and relevance. The demo samples and codes are available at https://1mageyi.github.io/V2A-SceneDetector.demo/.

Duke Scholars

Published In

2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025

DOI

Publication Date

January 1, 2025

Start / End Page

1981 / 1985
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Yi, M., Wang, Y., & Li, M. (2025). Efficient Video to Audio Mapper with Visual Scene Detection. In 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025 (pp. 1981–1985). https://doi.org/10.1109/APSIPAASC65261.2025.11249337
Yi, M., Y. Wang, and M. Li. “Efficient Video to Audio Mapper with Visual Scene Detection.” In 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025, 1981–85, 2025. https://doi.org/10.1109/APSIPAASC65261.2025.11249337.
Yi M, Wang Y, Li M. Efficient Video to Audio Mapper with Visual Scene Detection. In: 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025. 2025. p. 1981–5.
Yi, M., et al. “Efficient Video to Audio Mapper with Visual Scene Detection.” 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025, 2025, pp. 1981–85. Scopus, doi:10.1109/APSIPAASC65261.2025.11249337.
Yi M, Wang Y, Li M. Efficient Video to Audio Mapper with Visual Scene Detection. 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025. 2025. p. 1981–1985.

Published In

2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025

DOI

Publication Date

January 1, 2025

Start / End Page

1981 / 1985