Skip to main content

VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset

Publication ,  Conference
Wang, Y; Zhang, Q; Nishizaki, H; Li, M
Published in: Proceedings of the Annual Conference of the International Speech Communication Association Interspeech
January 1, 2025

Most existing deepfake datasets focus on speech synthesis or voice cloning, with little attention given to non-speech environmental sounds. Existing audio-focused datasets also lack video content, restricting progress in multimodal detection. To bridge gaps, we introduce VCapAV, a large-scale audio-visual dataset, designed to advance deepfake detection research involving environmental sound manipulations in multimodal scenarios. VCapAV is constructed through an innovative data generation pipeline that synthesizes realistic environmental audio using Text-to-Audio and Video-to-Audio approaches, while deepfake videos are generated through a Text-to-Video model. We establish two baseline detection tasks on this dataset: (i) audio-only deepfake detection, and (ii) visual-only deepfake detection. Experimental results show that existing detection models on the VCapAV dataset compared to standard datasets such as ASVspoof 2019 LA and AV-Deepfake1M. The dataset and baseline codes* are released.

Duke Scholars

Published In

Proceedings of the Annual Conference of the International Speech Communication Association Interspeech

DOI

EISSN

2958-1796

ISSN

2308-457X

Publication Date

January 1, 2025

Start / End Page

3908 / 3912
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Wang, Y., Zhang, Q., Nishizaki, H., & Li, M. (2025). VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset. In Proceedings of the Annual Conference of the International Speech Communication Association Interspeech (pp. 3908–3912). https://doi.org/10.21437/Interspeech.2025-1713
Wang, Y., Q. Zhang, H. Nishizaki, and M. Li. “VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset.” In Proceedings of the Annual Conference of the International Speech Communication Association Interspeech, 3908–12, 2025. https://doi.org/10.21437/Interspeech.2025-1713.
Wang Y, Zhang Q, Nishizaki H, Li M. VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset. In: Proceedings of the Annual Conference of the International Speech Communication Association Interspeech. 2025. p. 3908–12.
Wang, Y., et al. “VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset.” Proceedings of the Annual Conference of the International Speech Communication Association Interspeech, 2025, pp. 3908–12. Scopus, doi:10.21437/Interspeech.2025-1713.
Wang Y, Zhang Q, Nishizaki H, Li M. VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset. Proceedings of the Annual Conference of the International Speech Communication Association Interspeech. 2025. p. 3908–3912.

Published In

Proceedings of the Annual Conference of the International Speech Communication Association Interspeech

DOI

EISSN

2958-1796

ISSN

2308-457X

Publication Date

January 1, 2025

Start / End Page

3908 / 3912