Skip to main content

Improving the Robustness of Audio-Visual Target Speaker Extraction With AV-HuBERT Based Lip Features

Publication ,  Conference
Du, J; Jin, Z; Zeng, B; Yang, P; Li, M; Liu, J
Published in: Communications in Computer and Information Science
January 1, 2026

The target speaker extraction task aims to extract clean speech of the target person from a segment of mixed speech. In recent years, audio-visual speech enhancement (AVSE) has been increasingly applied, and the use of visual information from the target speaker has important application value in noisy environments. However, existing AVSE methods often face the problem of insufficient robustness of visual features, especially when parts of the content are missing or the video quality is poor. This will significantly reduce the effectiveness of the extracted visual features, further affecting the extraction performance. To address this issue, this paper first introduces a power compression strategy to enhance the effective components of the speech signal and avoid overreliance on visual information. Then, an end-to-end training approach is adopted to optimize the feature extraction process, initially alleviating the problem of insufficient robustness of lip movement features. To further improve performance, this paper uses the self-supervised AV-HuBERT model to extract features of lip movement. Through its multimodal self-supervised learning strategy, it can capture more discriminative dynamic features of lip movements and also achieve deep consistency between audio and video features. Experimental results show that the proposed method achieves stable improvements in key metrics such as PESQ, STOI, and SI-SDR, verifying the importance of visual feature extraction in the AVSE task and providing ideas for target speaker extraction in complex scenarios.

Duke Scholars

Published In

Communications in Computer and Information Science

DOI

EISSN

1865-0937

ISSN

1865-0929

Publication Date

January 1, 2026

Volume

2662 CCIS

Start / End Page

482 / 493
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Du, J., Jin, Z., Zeng, B., Yang, P., Li, M., & Liu, J. (2026). Improving the Robustness of Audio-Visual Target Speaker Extraction With AV-HuBERT Based Lip Features. In Communications in Computer and Information Science (Vol. 2662 CCIS, pp. 482–493). https://doi.org/10.1007/978-981-95-5382-2_37
Du, J., Z. Jin, B. Zeng, P. Yang, M. Li, and J. Liu. “Improving the Robustness of Audio-Visual Target Speaker Extraction With AV-HuBERT Based Lip Features.” In Communications in Computer and Information Science, 2662 CCIS:482–93, 2026. https://doi.org/10.1007/978-981-95-5382-2_37.
Du J, Jin Z, Zeng B, Yang P, Li M, Liu J. Improving the Robustness of Audio-Visual Target Speaker Extraction With AV-HuBERT Based Lip Features. In: Communications in Computer and Information Science. 2026. p. 482–93.
Du, J., et al. “Improving the Robustness of Audio-Visual Target Speaker Extraction With AV-HuBERT Based Lip Features.” Communications in Computer and Information Science, vol. 2662 CCIS, 2026, pp. 482–93. Scopus, doi:10.1007/978-981-95-5382-2_37.
Du J, Jin Z, Zeng B, Yang P, Li M, Liu J. Improving the Robustness of Audio-Visual Target Speaker Extraction With AV-HuBERT Based Lip Features. Communications in Computer and Information Science. 2026. p. 482–493.

Published In

Communications in Computer and Information Science

DOI

EISSN

1865-0937

ISSN

1865-0929

Publication Date

January 1, 2026

Volume

2662 CCIS

Start / End Page

482 / 493