Scholars@Duke publication: Instructsing: High-Fidelity Singing Voice Generation Via Instructing Yourself

Instructsing: High-Fidelity Singing Voice Generation Via Instructing Yourself

Publication , Conference

Zeng, C; Wang, C; Miao, X; Zhao, J; Jiang, Z; Chen, Y

Published in: Proceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024

January 1, 2024

It is challenging to accelerate the training process while ensuring both high-quality generated voices and acceptable inference speed. In this paper, we propose a novel neural vocoder called InstructSing, which can converge much faster compared with other neural vocoders while maintaining good performance by integrating differentiable digital signal processing and adversarial training. It includes one generator and two discriminators. Specifically, the generator incorporates a harmonic-plus-noise (HN) module to produce 8 kHz audio as an instructive signal. Subsequently, the HN module is connected with an extended WaveNet by an UNet-based module, which transforms the output of the HN module to a latent variable sequence containing essential periodic and aperiodic information. In addition to the latent sequence, the extended WaveNet also takes the melspectrogram as input to generate 48 kHz high-fidelity singing voices. In terms of discriminators, we combine a multi-period discriminator, as originally proposed in HiFiGAN, with a multi-resolution multiband STFT discriminator. Notably, InstructSing achieves comparable voice quality to other neural vocoders but with only one-tenth of the training steps on a 4 NVIDIA V100 GPU machine¹. We plan to open-source our code and pretrained model once the paper get accepted.

Duke Scholars

Author Xiaoxiao Miao DKU Faculty

Published In

Proceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024

DOI

10.1109/SLT61566.2024.10832285

Publication Date

January 1, 2024

Start / End Page

675 / 681

Citation

APA

Chicago

ICMJE

MLA

NLM

Zeng, C., Wang, C., Miao, X., Zhao, J., Jiang, Z., & Chen, Y. (2024). Instructsing: High-Fidelity Singing Voice Generation Via Instructing Yourself. In Proceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024 (pp. 675–681). https://doi.org/10.1109/SLT61566.2024.10832285

Zeng, C., C. Wang, X. Miao, J. Zhao, Z. Jiang, and Y. Chen. “Instructsing: High-Fidelity Singing Voice Generation Via Instructing Yourself.” In Proceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024, 675–81, 2024. https://doi.org/10.1109/SLT61566.2024.10832285.

Zeng C, Wang C, Miao X, Zhao J, Jiang Z, Chen Y. Instructsing: High-Fidelity Singing Voice Generation Via Instructing Yourself. In: Proceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024. 2024. p. 675–81.

Zeng, C., et al. “Instructsing: High-Fidelity Singing Voice Generation Via Instructing Yourself.” Proceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024, 2024, pp. 675–81. Scopus, doi:10.1109/SLT61566.2024.10832285.

Zeng C, Wang C, Miao X, Zhao J, Jiang Z, Chen Y. Instructsing: High-Fidelity Singing Voice Generation Via Instructing Yourself. Proceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024. 2024. p. 675–681.

Published In

Proceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024

DOI

10.1109/SLT61566.2024.10832285

Publication Date

January 1, 2024

Start / End Page

675 / 681