The Whu Wake Word Lipreading System for the 2024 Chat-Scenario Chinese Lipreading Challenge
The paper describes the Wake Word Lipreading system developed by the WHU team for the ChatCLR Challenge 2024. Although Lipreading and Wake Word Spotting have seen significant development, exploration of pretrained frontends for Wake Word Lipreading (WWL) remains insufficient. Our system is built upon a pretrained frontend and Transformer-liked backend architecture, incorporating Attentive Pooling and a Classifier. We investigate the effectiveness of different frontends, including Auto-AVSR and AV-Hubert, and evaluate the performance of Conformer and E-Branchformer backends. Additionally, we introduce Multi-layer Feature Aggregation to leverage features from multiple encoder block layers, demonstrating its effectiveness. Finally, we apply various fusion strategies, leading to score fusion that achieved a false reject rate of 8.21% and a false alarm rate of 8.50% along with a WWS score of 16.71% on the evaluation set, and obtain the first place in the task 1 of the ChatCLR Challenge.