Speaker diarization system for autism children's real-life audio data
In this paper, we introduce several methods to improve the performance of speaker diarization system for autism children's real-life audio data. This system serves as the frontend module for further speech analysis. Our objective is to detect the children's speech from single channel, noisy and daily audio recordings collected by wearable devices in real environment. First, in the conventional generalized likelihood ratio (GLR) distance with agglomerative hierarchical clustering (AHC) framework, besides using the line spectral pair (LSP) based GLR distance, we further propose a weighted summation of multiple GLR distances combining LSP, pitch, energy and phoneme duration information together. Second, since we only want to extract children's speech in high purity for further speech analysis, we utilize a 30 seconds long enrollment utterance from each child to perform supervised child cluster selection using i-vector cosine distance. We find out that performing supervised cluster selection at AHC early stages generates higher purity. We evaluate our methods on a 120 minutes subset data collected from three children during the child-therapist interactions. Experimental results show that our methods significantly outperform the GLR-AHC baseline in terms of child cluster's recall and precision.