Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios
The common target speech separation directly estimates the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation (MTSS) model to simultaneously extract each speaker’s voice from the mixed speech rather than just optimally estimating the target source. Moreover, we propose a speaker diarization (SD) aware MTSS system (SD-MTSS). By exploiting the target speaker voice activity detection (TSVAD) and the estimated mask, our SD-MTSS model can extract the speech signal of each speaker concurrently in a conversational recording without additional enrollment audio in advance. Experimental results show that our MTSS model achieves improvements of 1.38 dB signal-to-distortion ratio (SDR), 1.34 dB scale-invariant signal-to-distortion ratio (SISDR), and 0.13 perceptual evaluation of speech quality (PESQ) over the baseline on the WSJ0-2mix-extr dataset, separately. The SD-MTSS system makes a 19.2% relative speaker dependent character error rate reduction on the Alimeeting dataset.
Duke Scholars
Published In
DOI
EISSN
ISSN
Publication Date
Related Subject Headings
- General Science & Technology
- 4015 Maritime engineering
- 0911 Maritime Engineering
- 0906 Electrical and Electronic Engineering
Citation
Published In
DOI
EISSN
ISSN
Publication Date
Related Subject Headings
- General Science & Technology
- 4015 Maritime engineering
- 0911 Maritime Engineering
- 0906 Electrical and Electronic Engineering