Selective Channel Attention based Target Speaker Voice Activity Detection for Speaker Diarization under AD-HOC Microphone Array Settings
Speaker diarization benefits from multi-channel microphone arrays, yet current systems struggle with diverse configurations. We address this by simulating a dataset with various microphone topologies and proposing Selective Channel Attention-based Target Speaker Voice Activity Detection (SCATSVAD). We utilize cross-channel self-attention with masking mechanisms to enable selective attention on specific channels, allowing for the effective processing of audio data with variable multi-channel configurations. SCA-TSVAD is built upon the foundation of single-channel TSVAD. It performs superior on our simulated dataset, showcasing its robustness across diverse array configurations. To further validate the effectiveness of a real dataset, we evaluate SCA-TSVAD on the real-world Ali-Meeting database, where it successfully handles multi-channel audio inputs even when some channels were unavailable or malfunctioning, proving its practical applicability.