Skip to main content

Xiaoxiao Miao

Assistant Professor of Computer Science at Duke Kunshan University
DKU Faculty

Selected Publications


Adapting general disentanglement-based speaker anonymization for enhanced emotion preservation

Journal Article Computer Speech and Language · November 1, 2025 A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, ... Full text Cite

A Benchmark for Multi-Speaker Anonymization

Journal Article IEEE Transactions on Information Forensics and Security · January 1, 2025 Privacy-preserving voice protection approaches primarily suppress privacy-related information derived from paralinguistic attributes while preserving the linguistic content. Existing solutions focus particularly on single-speaker scenarios. However, they l ... Full text Cite

The First VoicePrivacy Attacker Challenge

Conference ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings · January 1, 2025 The First VoicePrivacy Attacker Challenge is an ICASSP 2025 SP Grand Challenge which focuses on evaluating attacker systems against a set of voice anonymization systems submitted to the VoicePrivacy 2024 Challenge. Training, development, and evaluation dat ... Full text Cite

Automated evaluation of children's speech fluency for low-resource languages

Conference Proceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2025 Assessment of children's speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper proposes a system to automatically assess fluency by combining a fine-tuned multilingual ... Full text Cite

LSPnet: an ultra-low bitrate hybrid neural codec

Conference Proceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2025 This paper presents an ultra-low bitrate speech codec that achieves high-fidelity speech coding at 1.2kbps while maintaining low computational complexity. Building upon the LPCNet framework, combined with a parametric encoder, we introduce several key impr ... Full text Cite

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

Journal Article Computer Speech and Language · June 1, 2024 Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant ... Full text Cite

VoicePAT: An Efficient Open-Source Evaluation Toolkit for Voice Privacy Research

Journal Article IEEE Open Journal of Signal Processing · January 1, 2024 Speaker anonymization is the task of modifying a speech recording such that the original speaker cannot be identified anymore. Since the first Voice Privacy Challenge in 2020, along with the release of a framework, the popularity of this research topic is ... Full text Cite

Target Speaker Extraction with Curriculum Learning

Conference Proceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2024 This paper presents a novel approach to target speaker extraction (TSE) using Curriculum Learning (CL) techniques, addressing the challenge of distinguishing a target speaker's voice from a mixture containing interfering speakers. For efficient training, w ... Full text Cite

Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches

Conference Proceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024 · January 1, 2024 In real-world applications, it is challenging to build a speaker verification system that is simultaneously robust against common threats, including spoofing attacks, channel mismatch, and domain mismatch. Traditional automatic speaker verification (ASV) s ... Full text Cite

SYNVOX2: TOWARDS A PRIVACY-FRIENDLY VOXCELEB2 DATASET

Conference ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings · January 1, 2024 The success of deep learning in speaker recognition relies heavily on the use of large datasets. However, the data-hungry nature of deep learning methods has already being questioned on account the ethical, privacy, and legal concerns that arise when using ... Full text Cite

Instructsing: High-Fidelity Singing Voice Generation Via Instructing Yourself

Conference Proceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024 · January 1, 2024 It is challenging to accelerate the training process while ensuring both high-quality generated voices and acceptable inference speed. In this paper, we propose a novel neural vocoder called InstructSing, which can converge much faster compared with other ... Full text Cite

The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation

Journal Article IEEE ACM Transactions on Audio Speech and Language Processing · January 1, 2024 —The VoicePrivacy Challenge promotes the development of voice anonymisation solutions for speech technology. In this paper we present a systematic overview and analysis of the second edition held in 2022. We describe the voice anonymisation task and datase ... Full text Cite

Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms

Conference Proceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2023 The ability of countermeasure models to generalize from seen speech synthesis methods to unseen ones has been investigated in the ASVspoof challenge. However, a new mismatch scenario in which fake audio may be generated from real audio with unseen genres h ... Full text Cite

Hiding Speaker's Sex in Speech Using Zero-Evidence Speaker Representation in an Analysis/Synthesis Pipeline

Conference ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings · January 1, 2023 The use of modern vocoders in an analysis/synthesis pipeline allows us to investigate high-quality voice conversion that can be used for privacy purposes. Here, we propose to transform the speaker embedding and the pitch in order to hide the sex of the spe ... Full text Cite

Speaker Anonymization Using Orthogonal Householder Neural Network

Journal Article IEEE ACM Transactions on Audio Speech and Language Processing · January 1, 2023 Speaker anonymization aims to conceal a speaker's identity while preserving content information in speech. Current mainstream neural-network speaker anonymization systems disentangle speech into prosody-related, content, and speaker representations. The sp ... Full text Cite

GuidedMix: An on-the-fly data augmentation approach for robust speaker recognition system

Journal Article Electronics Letters · January 1, 2022 Data augmentation is an essential technique for building a high-robustness speaker recognition system. this letter proposes a novel on-the-fly data augmentation strategy called GuidedMix. It significantly increases augmented data fidelity by mixing the spe ... Full text Cite

Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions

Conference Proceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2022 In our previous work, we proposed a language-independent speaker anonymization system based on self-supervised learning models. Although the system can anonymize speech data of any language, the anonymization was imperfect, and the speech content of the an ... Full text Cite

ATTENTION BACK-END FOR AUTOMATIC SPEAKER VERIFICATION WITH MULTIPLE ENROLLMENT UTTERANCES

Conference ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings · January 1, 2022 Probabilistic linear discriminant analysis (PLDA) or cosine similarity have been widely used in traditional speaker verification systems as back-end techniques to measure pairwise similarities. To make better use of multiple enrollment utterances, we propo ... Full text Cite

Variance Normalised Features for Language and Dialect Discrimination

Journal Article Circuits Systems and Signal Processing · July 1, 2021 This paper proposes novel features for automated language and dialect identification that aim to improve discriminative power by ensuring that each element of the feature vector has a normalised contribution to inter-class variance. The method firstly comp ... Full text Cite

D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition.

Journal Article Neural networks : the official journal of the International Neural Network Society · July 2021 Attention-based convolutional neural network (CNN) models are increasingly being adopted for speaker and language recognition (SR/LR) tasks. These include time, frequency, spatial and channel attention, which can focus on useful time frames, frequency band ... Full text Cite

Adaptive margin circle loss for speaker verification

Conference Proceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2021 Deep-Neural-Network (DNN) based speaker verification systems use the angular softmax loss with margin penalties to enhance the intra-class compactness of speaker embeddings, which achieved remarkable performance. In this paper, we propose a novel angular l ... Full text Cite

Cross-domain speaker recognition using domain adversarial siamese network with a domain discriminator

Journal Article Electronics Letters · July 9, 2020 With the widespread use of automatic speaker recognition in realistic world, it suffers a lot when there is a domain mismatch, including channel, language, distance etc. Recent research studies have introduced the adversarial-learning mechanism into deep n ... Full text Cite

A New Time–Frequency Attention Tensor Network for Language Identification

Journal Article Circuits Systems and Signal Processing · May 1, 2020 In this paper, we aim to improve traditional DNN x-vector language identification performance by employing wide residual networks (WRN) as a powerful feature extractor which we combine with a novel frequency attention network. Compared with conventional ti ... Full text Cite

Denoising Autoencoder-Based Language Feature Compensation

Journal Article Jisuanji Yanjiu Yu Fazhan Computer Research and Development · May 1, 2019 Language identification (LID) accuracy is often significantly reduced when the duration of the test data and the training data are mismatched. This paper proposes a method to compensate language features using a denoising autoencoder (DAE). Use of denoisin ... Full text Cite

A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN, with application to language identification

Conference Proceedings of the Annual Conference of the International Speech Communication Association Interspeech · January 1, 2019 In this paper, we aim to improve traditional DNN x-vector language identification (LID) performance by employing Convolutional and Long Short Term Memory-Recurrent (CLSTM) Neural Networks, as they can strengthen feature extraction and capture longer tempor ... Full text Cite

Improved Conditional Generative Adversarial Net Classification for Spoken Language Recognition

Conference 2018 IEEE Spoken Language Technology Workshop Slt 2018 Proceedings · July 2, 2018 Recent research on generative adversarial nets (GAN) for language identification (LID) has shown promising results. In this paper, we further exploit the latent abilities of GAN networks to firstly combine them with deep neural network (DNN)-based i-vector ... Full text Cite

Expanding the length of short utterances for short-duration language recognition

Journal Article Qinghua Daxue Xuebao Journal of Tsinghua University · March 1, 2018 The language recognition (LR) accuracy is often significantly reduced when the test utterance duration is as short as 10 s or less. This paper describes a method to extend the utterance length using time-scale modification (TSM) which changes the speech ra ... Full text Cite