VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset
Most existing deepfake datasets focus on speech synthesis or voice cloning, with little attention given to non-speech environmental sounds. Existing audio-focused datasets also lack video content, restricting progress in multimodal detection. To bridge gaps, we introduce VCapAV, a large-scale audio-visual dataset, designed to advance deepfake detection research involving environmental sound manipulations in multimodal scenarios. VCapAV is constructed through an innovative data generation pipeline that synthesizes realistic environmental audio using Text-to-Audio and Video-to-Audio approaches, while deepfake videos are generated through a Text-to-Video model. We establish two baseline detection tasks on this dataset: (i) audio-only deepfake detection, and (ii) visual-only deepfake detection. Experimental results show that existing detection models on the VCapAV dataset compared to standard datasets such as ASVspoof 2019 LA and AV-Deepfake1M. The dataset and baseline codes* are released.