Skip to main content
Journal cover image

Multimodal laryngoscopic video analysis for assisted diagnosis of vocal fold paralysis

Publication ,  Journal Article
Zhang, Y; Zou, X; Yang, J; Chen, W; Liu, J; Liang, F; Li, M
Published in: Computer Speech and Language
February 1, 2026

This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks. To get better masks, we introduce a diffusion-based refinement that follows traditional U-Net segmentation to reduce false positives. We conducted several ablation studies to demonstrate the effectiveness of each module and modalities in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, unilateral VFP classification results on a real-world clinic dataset demonstrate MLVAS's ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.

Duke Scholars

Published In

Computer Speech and Language

DOI

EISSN

1095-8363

ISSN

0885-2308

Publication Date

February 1, 2026

Volume

96

Related Subject Headings

  • Speech-Language Pathology & Audiology
  • 46 Information and computing sciences
  • 40 Engineering
  • 2004 Linguistics
  • 1702 Cognitive Sciences
  • 0801 Artificial Intelligence and Image Processing
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Zhang, Y., Zou, X., Yang, J., Chen, W., Liu, J., Liang, F., & Li, M. (2026). Multimodal laryngoscopic video analysis for assisted diagnosis of vocal fold paralysis (Accepted). Computer Speech and Language, 96. https://doi.org/10.1016/j.csl.2025.101891
Zhang, Y., X. Zou, J. Yang, W. Chen, J. Liu, F. Liang, and M. Li. “Multimodal laryngoscopic video analysis for assisted diagnosis of vocal fold paralysis (Accepted).” Computer Speech and Language 96 (February 1, 2026). https://doi.org/10.1016/j.csl.2025.101891.
Zhang Y, Zou X, Yang J, Chen W, Liu J, Liang F, et al. Multimodal laryngoscopic video analysis for assisted diagnosis of vocal fold paralysis (Accepted). Computer Speech and Language. 2026 Feb 1;96.
Zhang, Y., et al. “Multimodal laryngoscopic video analysis for assisted diagnosis of vocal fold paralysis (Accepted).” Computer Speech and Language, vol. 96, Feb. 2026. Scopus, doi:10.1016/j.csl.2025.101891.
Zhang Y, Zou X, Yang J, Chen W, Liu J, Liang F, Li M. Multimodal laryngoscopic video analysis for assisted diagnosis of vocal fold paralysis (Accepted). Computer Speech and Language. 2026 Feb 1;96.
Journal cover image

Published In

Computer Speech and Language

DOI

EISSN

1095-8363

ISSN

0885-2308

Publication Date

February 1, 2026

Volume

96

Related Subject Headings

  • Speech-Language Pathology & Audiology
  • 46 Information and computing sciences
  • 40 Engineering
  • 2004 Linguistics
  • 1702 Cognitive Sciences
  • 0801 Artificial Intelligence and Image Processing