Journal ArticleComputer Speech and Language · November 1, 2025
A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, ...
Full textCite
Journal ArticleIEEE Transactions on Information Forensics and Security · January 1, 2025
Privacy-preserving voice protection approaches primarily suppress privacy-related information derived from paralinguistic attributes while preserving the linguistic content. Existing solutions focus particularly on single-speaker scenarios. However, they l ...
Full textCite
ConferenceICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings · January 1, 2025
The First VoicePrivacy Attacker Challenge is an ICASSP 2025 SP Grand Challenge which focuses on evaluating attacker systems against a set of voice anonymization systems submitted to the VoicePrivacy 2024 Challenge. Training, development, and evaluation dat ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2025
Assessment of children's speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper proposes a system to automatically assess fluency by combining a fine-tuned multilingual ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2025
This paper presents an ultra-low bitrate speech codec that achieves high-fidelity speech coding at 1.2kbps while maintaining low computational complexity. Building upon the LPCNet framework, combined with a parametric encoder, we introduce several key impr ...
Full textCite
Journal ArticleComputer Speech and Language · June 1, 2024
Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant ...
Full textCite
Journal ArticleIEEE Open Journal of Signal Processing · January 1, 2024
Speaker anonymization is the task of modifying a speech recording such that the original speaker cannot be identified anymore. Since the first Voice Privacy Challenge in 2020, along with the release of a framework, the popularity of this research topic is ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2024
This paper presents a novel approach to target speaker extraction (TSE) using Curriculum Learning (CL) techniques, addressing the challenge of distinguishing a target speaker's voice from a mixture containing interfering speakers. For efficient training, w ...
Full textCite
ConferenceProceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024 · January 1, 2024
In real-world applications, it is challenging to build a speaker verification system that is simultaneously robust against common threats, including spoofing attacks, channel mismatch, and domain mismatch. Traditional automatic speaker verification (ASV) s ...
Full textCite
ConferenceICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings · January 1, 2024
The success of deep learning in speaker recognition relies heavily on the use of large datasets. However, the data-hungry nature of deep learning methods has already being questioned on account the ethical, privacy, and legal concerns that arise when using ...
Full textCite
ConferenceProceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024 · January 1, 2024
It is challenging to accelerate the training process while ensuring both high-quality generated voices and acceptable inference speed. In this paper, we propose a novel neural vocoder called InstructSing, which can converge much faster compared with other ...
Full textCite
Journal ArticleIEEE ACM Transactions on Audio Speech and Language Processing · January 1, 2024
—The VoicePrivacy Challenge promotes the development of voice anonymisation solutions for speech technology. In this paper we present a systematic overview and analysis of the second edition held in 2022. We describe the voice anonymisation task and datase ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2023
The ability of countermeasure models to generalize from seen speech synthesis methods to unseen ones has been investigated in the ASVspoof challenge. However, a new mismatch scenario in which fake audio may be generated from real audio with unseen genres h ...
Full textCite
ConferenceICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings · January 1, 2023
The use of modern vocoders in an analysis/synthesis pipeline allows us to investigate high-quality voice conversion that can be used for privacy purposes. Here, we propose to transform the speaker embedding and the pitch in order to hide the sex of the spe ...
Full textCite
Journal ArticleIEEE ACM Transactions on Audio Speech and Language Processing · January 1, 2023
Speaker anonymization aims to conceal a speaker's identity while preserving content information in speech. Current mainstream neural-network speaker anonymization systems disentangle speech into prosody-related, content, and speaker representations. The sp ...
Full textCite
Journal ArticleElectronics Letters · January 1, 2022
Data augmentation is an essential technique for building a high-robustness speaker recognition system. this letter proposes a novel on-the-fly data augmentation strategy called GuidedMix. It significantly increases augmented data fidelity by mixing the spe ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2022
In our previous work, we proposed a language-independent speaker anonymization system based on self-supervised learning models. Although the system can anonymize speech data of any language, the anonymization was imperfect, and the speech content of the an ...
Full textCite
ConferenceICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings · January 1, 2022
Probabilistic linear discriminant analysis (PLDA) or cosine similarity have been widely used in traditional speaker verification systems as back-end techniques to measure pairwise similarities. To make better use of multiple enrollment utterances, we propo ...
Full textCite
Journal ArticleCircuits Systems and Signal Processing · July 1, 2021
This paper proposes novel features for automated language and dialect identification that aim to improve discriminative power by ensuring that each element of the feature vector has a normalised contribution to inter-class variance. The method firstly comp ...
Full textCite
Journal ArticleNeural networks : the official journal of the International Neural Network Society · July 2021
Attention-based convolutional neural network (CNN) models are increasingly being adopted for speaker and language recognition (SR/LR) tasks. These include time, frequency, spatial and channel attention, which can focus on useful time frames, frequency band ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2021
Deep-Neural-Network (DNN) based speaker verification systems use the angular softmax loss with margin penalties to enhance the intra-class compactness of speaker embeddings, which achieved remarkable performance. In this paper, we propose a novel angular l ...
Full textCite
Journal ArticleElectronics Letters · July 9, 2020
With the widespread use of automatic speaker recognition in realistic world, it suffers a lot when there is a domain mismatch, including channel, language, distance etc. Recent research studies have introduced the adversarial-learning mechanism into deep n ...
Full textCite
Journal ArticleCircuits Systems and Signal Processing · May 1, 2020
In this paper, we aim to improve traditional DNN x-vector language identification performance by employing wide residual networks (WRN) as a powerful feature extractor which we combine with a novel frequency attention network. Compared with conventional ti ...
Full textCite
Journal ArticleJisuanji Yanjiu Yu Fazhan Computer Research and Development · May 1, 2019
Language identification (LID) accuracy is often significantly reduced when the duration of the test data and the training data are mismatched. This paper proposes a method to compensate language features using a denoising autoencoder (DAE). Use of denoisin ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2019
In this paper, we aim to improve traditional DNN x-vector language identification (LID) performance by employing Convolutional and Long Short Term Memory-Recurrent (CLSTM) Neural Networks, as they can strengthen feature extraction and capture longer tempor ...
Full textCite
Conference2018 IEEE Spoken Language Technology Workshop Slt 2018 Proceedings · July 2, 2018
Recent research on generative adversarial nets (GAN) for language identification (LID) has shown promising results. In this paper, we further exploit the latent abilities of GAN networks to firstly combine them with deep neural network (DNN)-based i-vector ...
Full textCite
Journal ArticleQinghua Daxue Xuebao Journal of Tsinghua University · March 1, 2018
The language recognition (LR) accuracy is often significantly reduced when the test utterance duration is as short as 10 s or less. This paper describes a method to extend the utterance length using time-scale modification (TSM) which changes the speech ra ...
Full textCite