SMIIP-NV: A Multi-Annotation Non-Verbal Expressive Speech Corpus in Mandarin for LLM-Based Speech Synthesis
In natural language communication, emotions are often conveyed through non-verbal sounds (NVs), such as laughter, crying, cough and so on. However, most existing text-to-speech (TTS) corpora lack annotations for these non-verbal sounds, leading to a scarcity of systems capable of generating them. To address this gap, we introduce SMIIP-NV, a non-verbal speech synthesis corpus annotated with both emotions and non-verbal sounds, including laughter, crying, and cough. To the best of our knowledge, SMIIP-NV is the largest publicly available open-source expressive speech corpus that includes non-verbal speech and rich annotations. It comprises 33 hours of speech data, covering five distinct emotions and three types of non-verbal sounds, with detailed transcriptions and precise timestamps for each occurrence of non-verbal sounds. Additionally, the corpus provides annotations for speech segments that contain laughter or crying. To demonstrate the utility of this dataset, we establish a baseline for non-verbal speech synthesis by employing a lightweight large language model (LLM). The SMIIP-NV dataset and static audio demonstrations are publicly available at https://axunyii.github.io/SMIIP-NV. The interactive real-time demonstrations can be accessed at https://huggingface.co/spaces/xunyi/SMIIP-NV_Finetuned_CosyVoice2.