The DKU speech activity detection and speaker identification systems for fearless steps challenge phase-02
This paper describes the systems developed by the DKU team for the Fearless Steps Challenge Phase-02 competition. For the Speech Activity Detection task, we start with the Long Short-Term Memory (LSTM) system and then apply the ResNet-LSTM improvement. Our ResNet-LSTM system reduces the DCF error by about 38% relatively in comparison with the LSTM baseline. We also discuss the system performance with additional training corpora included, and the lowest DCF of 1.406% on the Eval Set is gained with system pre-training. As for the Speaker Identification task, we employ the Deep ResNet vector system, which receives a variable-length feature sequence and directly generates speaker posteriors. The pre-training process with Voxceleb is also considered, and our best-performing system achieves the Top-5 accuracy of 92.393% on the Eval Set.