Scholars@Duke publication: SLIDESPEECH: A LARGE SCALE SLIDE-ENRICHED AUDIO-VISUAL CORPUS

SLIDESPEECH: A LARGE SCALE SLIDE-ENRICHED AUDIO-VISUAL CORPUS

Publication , Conference

Wang, H; Yu, F; Shi, X; Wang, Y; Zhang, S; Li, M

Published in: ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

January 1, 2024

Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches primarily focus on video or contextual information, the utilization of extra supplementary textual information has been overlooked. Recognizing the abundance of online conference videos with slides, which provide rich domain-specific information in the form of text and images, we release SlideSpeech, a large-scale audio-visual corpus enriched with slides. The corpus contains 1,705 videos, 1,000+ hours, with 473 hours of high-quality transcribed speech. Moreover, the corpus contains a significant amount of real-time synchronized slides. In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. Through the application of keyword extraction and contextual ASR methods in the benchmark system, we demonstrate the potential of improving speech recognition performance by incorporating textual information from supplementary video slides.

Duke Scholars

Author Ming Li DKU Faculty

Published In

ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

DOI

10.1109/ICASSP48485.2024.10448079

ISSN

1520-6149

Publication Date

January 1, 2024

Start / End Page

11076 / 11080

Citation

APA

Chicago

ICMJE

MLA

NLM

Wang, H., Yu, F., Shi, X., Wang, Y., Zhang, S., & Li, M. (2024). SLIDESPEECH: A LARGE SCALE SLIDE-ENRICHED AUDIO-VISUAL CORPUS. In ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings (pp. 11076–11080). https://doi.org/10.1109/ICASSP48485.2024.10448079

Wang, H., F. Yu, X. Shi, Y. Wang, S. Zhang, and M. Li. “SLIDESPEECH: A LARGE SCALE SLIDE-ENRICHED AUDIO-VISUAL CORPUS.” In ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 11076–80, 2024. https://doi.org/10.1109/ICASSP48485.2024.10448079.

Wang H, Yu F, Shi X, Wang Y, Zhang S, Li M. SLIDESPEECH: A LARGE SCALE SLIDE-ENRICHED AUDIO-VISUAL CORPUS. In: ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. 2024. p. 11076–80.

Wang, H., et al. “SLIDESPEECH: A LARGE SCALE SLIDE-ENRICHED AUDIO-VISUAL CORPUS.” ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2024, pp. 11076–80. Scopus, doi:10.1109/ICASSP48485.2024.10448079.

Wang H, Yu F, Shi X, Wang Y, Zhang S, Li M. SLIDESPEECH: A LARGE SCALE SLIDE-ENRICHED AUDIO-VISUAL CORPUS. ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. 2024. p. 11076–11080.

Published In

ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

DOI

10.1109/ICASSP48485.2024.10448079

ISSN

1520-6149

Publication Date

January 1, 2024

Start / End Page

11076 / 11080