Skip to main content

Ming Li

Associate Professor of Electrical and Computer Engineering at Duke Kunshan University
DKU Faculty

Selected Publications


StarRescue: the Design and Evaluation of A Turn-Taking Collaborative Game for Facilitating Autistic Children's Social Skills

Conference Conference on Human Factors in Computing Systems - Proceedings · May 11, 2024 Autism Spectrum Disorder (ASD) presents challenges in social interaction skill development, particularly in turn-taking. Digital interventions offer potential solutions for improving autistic children's social skills but often lack addressing specific coll ... Full text Cite

Integrating frame-level boundary detection and deepfake detection for locating manipulated regions in partially spoofed audio forgery attacks

Journal Article Computer Speech and Language · April 1, 2024 Partially fake audio, a variant of deep fake that involves manipulating audio utterances through the incorporation of fake or externally-sourced bona fide audio clips, constitutes a growing threat as an audio forgery attack impacting both human and artific ... Full text Open Access Cite

Assessing the Social Skills of Children with Autism Spectrum Disorder via Language-Image Pre-training Models

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2024 Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder that has gained global attention due to its prevalence. Clinical assessment measures rely heavily on manual scoring conducted by specialized physicians. However, this approach exhibits subject ... Full text Cite

Data Augmentation by Finite Element Analysis for Enhanced Machine Anomalous Sound Detection

Conference Communications in Computer and Information Science · January 1, 2024 Current data augmentation methods for machine anomalous sound detection (MASD) suffer from insufficient data generated by real world machines. Open datasets such as audioset are not tailored for machine sounds, and fake sounds created by generative models ... Full text Cite

The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023

Conference Communications in Computer and Information Science · January 1, 2024 This paper describes the DKU-MSXF submission to track 4 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). Our system pipeline contains voice activity detection, clustering-based diarization, overlapped speech detection, and target-speaker voi ... Full text Cite

Real-Time Automotive Engine Sound Simulation with Deep Neural Network

Conference Communications in Computer and Information Science · January 1, 2024 This paper introduces a real-time technique for simulating automotive engine sounds based on revolutions per minute (RPM) and pedal pressure data. We present a hybrid approach combining both sample-based and procedural methods. In the sample-based techniqu ... Full text Cite

Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios

Journal Article Journal of Shanghai Jiaotong University (Science) · January 1, 2024 The common target speech separation directly estimates the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation (MTSS) model to simultaneously extract each speaker’s voice fr ... Full text Cite

Joint Training on Multiple Datasets With Inconsistent Labeling Criteria for Facial Expression Recognition

Journal Article IEEE Transactions on Affective Computing · January 1, 2024 One potential way to enhance the performance of facial expression recognition (FER) is to augment the training set by increasing the number of samples. By incorporating multiple FER datasets, deep learning models can extract more discriminative features. H ... Full text Cite

EFFICIENT PERSONAL VOICE ACTIVITY DETECTION WITH WAKE WORD REFERENCE SPEECH

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2024 Personal voice activity detection (PVAD) is gradually used in speech assistants. Traditional PVAD schemes extract the target speaker's embedding from existing query reference speech through a pre-trained speaker verification model. Consequently, the perfor ... Full text Cite

INVERTIBLE VOICE CONVERSION WITH PARALLEL DATA

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2024 This paper introduces an innovative deep learning framework for parallel voice conversion to mitigate inherent risks associated with such systems. Our approach focuses on developing an invertible model capable of countering potential spoofing threats. Spec ... Full text Cite

A DUAL-PATH FRAMEWORK WITH FREQUENCY-AND-TIME EXCITED NETWORK FOR ANOMALOUS SOUND DETECTION

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2024 In contrast to human speech, machine-generated sounds of the same type often exhibit consistent frequency characteristics and discernible temporal periodicity. However, leveraging these dual attributes in anomaly detection remains relatively under-explored ... Full text Cite

JOINT INFERENCE OF SPEAKER DIARIZATION AND ASR WITH MULTI-STAGE INFORMATION SHARING

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2024 In this paper, we introduce a novel approach that unifies Automatic Speech Recognition (ASR) and speaker diarization in a cohesive framework. Utilizing the synergies between the two tasks, our method effectively extracts speaker-specific information from t ... Full text Cite

Investigating Long-Term and Short-Term Time-Varying Speaker Verification

Journal Article IEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2024 —The performance of speaker verification systems can be adversely affected by time domain variations. However, limited research has been conducted on time-varying speaker verification due to the absence of appropriate datasets. This paper aims to investiga ... Full text Cite

Leveraging ASR Pretrained Conformers for Speaker Verification Through Transfer Learning and Knowledge Distillation

Journal Article IEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2024 —This paper focuses on the application of Conformers in speaker verification. Conformers, initially designed for Automatic Speech Recognition (ASR), excel at modeling both local and global contexts within speech signals effectively. Building on this synerg ... Full text Cite

HSVRS: A Virtual Reality System of the Hide-and-Seek Game to Enhance Gaze Fixation Ability for Autistic Children

Journal Article IEEE Transactions on Learning Technologies · January 1, 2024 Numerous children diagnosed with Autism Spectrum Disorder (ASD) exhibit abnormal eye gaze pattern in communication and social interaction. In this study, we aim to investigate the effectiveness of the Hide and Seek Virtual Reality System (HSVRS) in improvi ... Full text Cite

The Whu Wake Word Lipreading System for the 2024 Chat-Scenario Chinese Lipreading Challenge

Conference 2024 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2024 · January 1, 2024 The paper describes the Wake Word Lipreading system developed by the WHU team for the ChatCLR Challenge 2024. Although Lipreading and Wake Word Spotting have seen significant development, exploration of pretrained frontends for Wake Word Lipreading (WWL) r ... Full text Cite

Two-stage and Self-supervised Voice Conversion for Zero-Shot Dysarthric Speech Reconstruction

Conference Proceedings of 2024 International Conference on Asian Language Processing, IALP 2024 · January 1, 2024 Dysarthria is a motor speech disorder commonly associated with conditions such as cerebral palsy, Parkinson's disease, amyotrophic lateral sclerosis, and stroke. Individuals with dysarthria typically exhibit significant speech difficulties, including impre ... Full text Cite

Expressive Language Profiles in a Clinical Screening Sample of Mandarin-Speaking Preschool Children With Autism Spectrum Disorder.

Journal Article Journal of speech, language, and hearing research : JSLHR · November 2023 PurposeThis cross-sectional study aimed to depict expressive language profiles and clarify lexical-grammatical interrelationships in Mandarin-speaking preschoolers with autism spectrum disorder (ASD) during the administration of the simplified Chi ... Full text Cite

Computer-Aided Autism Spectrum Disorder Diagnosis With Behavior Signal Processing

Journal Article IEEE Transactions on Affective Computing · October 1, 2023 Behavioral observation plays an essential role in the diagnosis of Autism Spectrum Disorder (ASD) by analyzing children's atypical patterns in social activities (e.g., impaired social interaction, restricted interests, and repetitive behavior). To date, th ... Full text Cite

A Complementary Dual-Branch Network for Appearance-Based Gaze Estimation From Low-Resolution Facial Image

Journal Article IEEE Transactions on Cognitive and Developmental Systems · September 1, 2023 Estimating gaze from a low-resolution (LR) facial image is a challenging task. Most current networks for gaze estimation focus on using face images of adequate resolution. Their performance degrades when the image resolution decreases due to information lo ... Full text Cite

StarRescue: the Design and Evaluation of A Turn-Taking Collaborative Game for Facilitating Autistic Children's Social Skills

Conference Conference on Human Factors in Computing Systems - Proceedings · May 11, 2024 Autism Spectrum Disorder (ASD) presents challenges in social interaction skill development, particularly in turn-taking. Digital interventions offer potential solutions for improving autistic children's social skills but often lack addressing specific coll ... Full text Cite

Integrating frame-level boundary detection and deepfake detection for locating manipulated regions in partially spoofed audio forgery attacks

Journal Article Computer Speech and Language · April 1, 2024 Partially fake audio, a variant of deep fake that involves manipulating audio utterances through the incorporation of fake or externally-sourced bona fide audio clips, constitutes a growing threat as an audio forgery attack impacting both human and artific ... Full text Open Access Cite

Assessing the Social Skills of Children with Autism Spectrum Disorder via Language-Image Pre-training Models

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2024 Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder that has gained global attention due to its prevalence. Clinical assessment measures rely heavily on manual scoring conducted by specialized physicians. However, this approach exhibits subject ... Full text Cite

Data Augmentation by Finite Element Analysis for Enhanced Machine Anomalous Sound Detection

Conference Communications in Computer and Information Science · January 1, 2024 Current data augmentation methods for machine anomalous sound detection (MASD) suffer from insufficient data generated by real world machines. Open datasets such as audioset are not tailored for machine sounds, and fake sounds created by generative models ... Full text Cite

The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023

Conference Communications in Computer and Information Science · January 1, 2024 This paper describes the DKU-MSXF submission to track 4 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). Our system pipeline contains voice activity detection, clustering-based diarization, overlapped speech detection, and target-speaker voi ... Full text Cite

Real-Time Automotive Engine Sound Simulation with Deep Neural Network

Conference Communications in Computer and Information Science · January 1, 2024 This paper introduces a real-time technique for simulating automotive engine sounds based on revolutions per minute (RPM) and pedal pressure data. We present a hybrid approach combining both sample-based and procedural methods. In the sample-based techniqu ... Full text Cite

Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios

Journal Article Journal of Shanghai Jiaotong University (Science) · January 1, 2024 The common target speech separation directly estimates the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation (MTSS) model to simultaneously extract each speaker’s voice fr ... Full text Cite

Joint Training on Multiple Datasets With Inconsistent Labeling Criteria for Facial Expression Recognition

Journal Article IEEE Transactions on Affective Computing · January 1, 2024 One potential way to enhance the performance of facial expression recognition (FER) is to augment the training set by increasing the number of samples. By incorporating multiple FER datasets, deep learning models can extract more discriminative features. H ... Full text Cite

EFFICIENT PERSONAL VOICE ACTIVITY DETECTION WITH WAKE WORD REFERENCE SPEECH

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2024 Personal voice activity detection (PVAD) is gradually used in speech assistants. Traditional PVAD schemes extract the target speaker's embedding from existing query reference speech through a pre-trained speaker verification model. Consequently, the perfor ... Full text Cite

INVERTIBLE VOICE CONVERSION WITH PARALLEL DATA

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2024 This paper introduces an innovative deep learning framework for parallel voice conversion to mitigate inherent risks associated with such systems. Our approach focuses on developing an invertible model capable of countering potential spoofing threats. Spec ... Full text Cite

A DUAL-PATH FRAMEWORK WITH FREQUENCY-AND-TIME EXCITED NETWORK FOR ANOMALOUS SOUND DETECTION

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2024 In contrast to human speech, machine-generated sounds of the same type often exhibit consistent frequency characteristics and discernible temporal periodicity. However, leveraging these dual attributes in anomaly detection remains relatively under-explored ... Full text Cite

JOINT INFERENCE OF SPEAKER DIARIZATION AND ASR WITH MULTI-STAGE INFORMATION SHARING

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2024 In this paper, we introduce a novel approach that unifies Automatic Speech Recognition (ASR) and speaker diarization in a cohesive framework. Utilizing the synergies between the two tasks, our method effectively extracts speaker-specific information from t ... Full text Cite

Investigating Long-Term and Short-Term Time-Varying Speaker Verification

Journal Article IEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2024 —The performance of speaker verification systems can be adversely affected by time domain variations. However, limited research has been conducted on time-varying speaker verification due to the absence of appropriate datasets. This paper aims to investiga ... Full text Cite

Leveraging ASR Pretrained Conformers for Speaker Verification Through Transfer Learning and Knowledge Distillation

Journal Article IEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2024 —This paper focuses on the application of Conformers in speaker verification. Conformers, initially designed for Automatic Speech Recognition (ASR), excel at modeling both local and global contexts within speech signals effectively. Building on this synerg ... Full text Cite

HSVRS: A Virtual Reality System of the Hide-and-Seek Game to Enhance Gaze Fixation Ability for Autistic Children

Journal Article IEEE Transactions on Learning Technologies · January 1, 2024 Numerous children diagnosed with Autism Spectrum Disorder (ASD) exhibit abnormal eye gaze pattern in communication and social interaction. In this study, we aim to investigate the effectiveness of the Hide and Seek Virtual Reality System (HSVRS) in improvi ... Full text Cite

The Whu Wake Word Lipreading System for the 2024 Chat-Scenario Chinese Lipreading Challenge

Conference 2024 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2024 · January 1, 2024 The paper describes the Wake Word Lipreading system developed by the WHU team for the ChatCLR Challenge 2024. Although Lipreading and Wake Word Spotting have seen significant development, exploration of pretrained frontends for Wake Word Lipreading (WWL) r ... Full text Cite

Two-stage and Self-supervised Voice Conversion for Zero-Shot Dysarthric Speech Reconstruction

Conference Proceedings of 2024 International Conference on Asian Language Processing, IALP 2024 · January 1, 2024 Dysarthria is a motor speech disorder commonly associated with conditions such as cerebral palsy, Parkinson's disease, amyotrophic lateral sclerosis, and stroke. Individuals with dysarthria typically exhibit significant speech difficulties, including impre ... Full text Cite

Expressive Language Profiles in a Clinical Screening Sample of Mandarin-Speaking Preschool Children With Autism Spectrum Disorder.

Journal Article Journal of speech, language, and hearing research : JSLHR · November 2023 PurposeThis cross-sectional study aimed to depict expressive language profiles and clarify lexical-grammatical interrelationships in Mandarin-speaking preschoolers with autism spectrum disorder (ASD) during the administration of the simplified Chi ... Full text Cite

Computer-Aided Autism Spectrum Disorder Diagnosis With Behavior Signal Processing

Journal Article IEEE Transactions on Affective Computing · October 1, 2023 Behavioral observation plays an essential role in the diagnosis of Autism Spectrum Disorder (ASD) by analyzing children's atypical patterns in social activities (e.g., impaired social interaction, restricted interests, and repetitive behavior). To date, th ... Full text Cite

A Complementary Dual-Branch Network for Appearance-Based Gaze Estimation From Low-Resolution Facial Image

Journal Article IEEE Transactions on Cognitive and Developmental Systems · September 1, 2023 Estimating gaze from a low-resolution (LR) facial image is a challenging task. Most current networks for gaze estimation focus on using face images of adequate resolution. Their performance degrades when the image resolution decreases due to information lo ... Full text Cite

Typical Facial Expression Network Using a Facial Feature Decoupler and Spatial-Temporal Learning

Journal Article IEEE Transactions on Affective Computing · April 1, 2023 Facial expression recognition (FER) accuracy is often affected by an individual's unique facial characteristics. Recognition performance can be improved if the influence from these physical characteristics is minimized. Using video instead of single image ... Full text Cite

Electrolaryngeal speech enhancement based on a two stage framework with bottleneck feature refinement and voice conversion

Journal Article Biomedical Signal Processing and Control · February 1, 2023 An electrolarynx (EL) is a medical device that generates speech for people who lost their biological larynx. However, EL speech signals are unnatural and unintelligible due to the monotonous pitch and the mechanical excitation of the EL device. This paper ... Full text Cite

Accurate Head Pose Estimation Using Image Rectification and a Lightweight Convolutional Neural Network

Journal Article IEEE Transactions on Multimedia · January 1, 2023 Head pose estimation is an important step for many human-computer interaction applications such as face detection, facial recognition, and facial expression classification. Accurate head pose estimation benefits these applications that require face images ... Full text Cite

Cross-lingual multi-speaker speech synthesis with limited bilingual training data

Journal Article Computer Speech and Language · January 1, 2023 Modeling voices for multiple speakers and multiple languages with one speech synthesis system has been a challenge for a long time, especially in low-resource cases. This paper presents two approaches to achieve cross-lingual multi-speaker text-to-speech ( ... Full text Cite

Robust Multi-Channel Far-Field Speaker Verification under Different In-Domain Data Availability Scenarios

Journal Article IEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2023 The popularity and application of smart home devices have made far-field speaker verification an urgent need. However, speaker verification performance is unsatisfactory under far-field environments despite its significant improvements enabled by deep neur ... Full text Cite

A multimodal machine learning system in early screening for toddlers with autism spectrum disorders based on the response to name.

Journal Article Frontiers in psychiatry · January 2023 BackgroundReduced or absence of the response to name (RTN) has been widely reported as an early specific indicator for autism spectrum disorder (ASD), while few studies have quantified the RTN of toddlers with ASD in an automatic way. The present ... Full text Cite

STCAM: Spatial-Temporal and Channel Attention Module for Dynamic Facial Expression Recognition

Journal Article IEEE Transactions on Affective Computing · January 1, 2023 Capturing the dynamics of facial expression progression in video is an essential and challenging task for facial expression recognition (FER). In this article, we propose an effective framework to address this challenge. We develop a C3D-based network arch ... Full text Cite

Detecting Escalation Level from Speech with Transfer Learning and Acoustic-Linguistic Information Fusion

Conference Communications in Computer and Information Science · January 1, 2023 Textual escalation detection has been widely applied to e-commerce companies’ customer service systems to pre-alert and prevent potential conflicts. Similarly, acoustic-based escalation detection systems are also helpful in enhancing passengers’ safety and ... Full text Cite

VC-AUG: Voice Conversion Based Data Augmentation for Text-Dependent Speaker Verification

Conference Communications in Computer and Information Science · January 1, 2023 In this paper, we focus on improving the performance of the text-dependent speaker verification system in the scenario of limited training data. The deep learning based text-dependent speaker verification system generally needs a large-scale text-dependent ... Full text Cite

Robust audio anti-spoofing countermeasure with joint training of front-end and back-end models

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2023 The accuracy and reliability of many speech processing systems may deteriorate under noisy conditions. This paper discusses robust audio anti-spoofing countermeasure for audio in noisy environments. Firstly, we attempt to use a pre-trained speech enhanceme ... Full text Cite

SEF-Net: Speaker Embedding Free Target Speaker Extraction Network

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2023 Most target speaker extraction methods use the target speaker embedding as reference information. However, the speaker embedding extracted by a speaker recognition module may not be optimal for the target speaker extraction tasks. In this paper, we propose ... Full text Cite

Outlier-aware Inlier Modeling and Multi-scale Scoring for Anomalous Sound Detection via Multitask Learning

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2023 This paper proposes an approach for anomalous sound detection that incorporates outlier exposure and inlier modeling within a unified framework by multitask learning. While outlier exposure-based methods can extract features efficiently, it is not robust. ... Full text Cite

The WHU-Alibaba Audio-Visual Speaker Diarization System for the MISP 2022 Challenge

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023 This paper describes the system developed by the WHU-Alibaba team for the Multimodal Information Based Speech Processing (MISP) 2022 Challenge. We extend the Sequence-to-Sequence Target-Speaker Voice Activity Detection framework to simultaneously detect mu ... Full text Cite

Target-Speaker Voice Activity Detection Via Sequence-to-Sequence Prediction

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023 Target-speaker voice activity detection is currently a promising approach for speaker diarization in complex acoustic environments. This paper presents a novel Sequence-to-Sequence Target-Speaker Voice Activity Detection (Seq2Seq-TSVAD) method that can eff ... Full text Cite

Identifying Source Speakers for Voice Conversion Based Spoofing Attacks on Speaker Verification Systems

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023 An automatic speaker verification system aims to verify the speaker identity of a speech signal. However, a voice conversion system could manipulate a person's speech signal to make it sound like another speaker's voice and deceive the speaker verification ... Full text Cite

Pretraining Conformer with ASR for Speaker Verification

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023 This paper proposes to pretrain Conformer with automatic speech recognition (ASR) task for speaker verification. Conformer combines convolution neural network (CNN) and Transformer model for modeling local and global features, respectively. Recently, multi ... Full text Cite

Exploring Universal Singing Speech Language Identification Using Self-Supervised Learning Based Front-End Features

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023 Despite the great performance of language identification (LID), there is a lack of large-scale singing LID databases to support the research of singing language identification (SLID). This paper proposed a over 3200 hours dataset used for singing language ... Full text Cite

Waveform Boundary Detection for Partially Spoofed Audio

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023 The present paper proposes a waveform boundary detection system for audio spoofing attacks containing partially manipulated segments. Partially spoofed/fake audio, where part of the utterance is replaced, either with synthetic or natural audio clips, has r ... Full text Cite

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023 This paper further explores our previous wake word spotting system ranked 2-nd in Track 1 of the MISP Challenge 2021. First, we investigate a robust unimodal approach based on 3D and 2D convolution and adopt the simple attention module (SimAM) for our syst ... Full text Cite

Low-complexity Multi-Channel Speaker Extraction with Pure Speech Cues

Conference 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023 · January 1, 2023 Most multi-channel speaker extraction schemes use the target speaker's location information as a reference, which must be known in advance or derived from visual cues. In addition, memory and computation costs are enormous when the model deals with the fus ... Full text Cite

From Speaker Verification to Deepfake Algorithm Recognition: Our Learned Lessons from ADD2023 Track3

Conference CEUR Workshop Proceedings · January 1, 2023 This paper presents our learned lessons from the ADD2023 track3, Deepfake Algorithm Recognition (AR). In recent years, speech synthesis has made remarkable progress, where it has become increasingly difficult for human listeners to differentiate between sy ... Cite

Bisinger: Bilingual Singing Voice Synthesis

Conference 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 · January 1, 2023 Although Singing Voice Synthesis (SVS) has made great strides with Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains relatively unexplored. This paper presents BiSinger, a bilingual pop SVS system for English and Chinese Mandarin ... Full text Cite

Addressing Sparse Annotation: a Novel Semantic Energy Loss for Tumor Cell Detection from Histopathologic Images

Conference Proceedings - 2023 2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023 · January 1, 2023 Tumor cell detection plays a vital role in immunohistochemistry (IHC) quantitative analysis. While recent remarkable developments in fully-supervised deep learning have greatly contributed to the efficiency of this task, the necessity for manually annotati ... Full text Cite

Haha-POD: An Attempt for Laughter-Based Non-Verbal Speaker Verification

Conference 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 · January 1, 2023 It is widely acknowledged that discriminative representation for speaker verification can be extracted from verbal speech. However, how much speaker information that non-verbal vocalization carries is still a puzzle. This paper explores speaker verificatio ... Full text Cite

AlignDet: Aligning Pre-training and Fine-tuning in Object Detection

Conference Proceedings of the IEEE International Conference on Computer Vision · January 1, 2023 The paradigm of large-scale pre-training followed by downstream fine-tuning has been widely employed in various object detection algorithms. In this paper, we reveal discrepancies in data, model, and task between the pre-training and fine-tuning procedure ... Full text Cite

A Hierarchical Vision Transformer Using Overlapping Patch and Self-Supervised Learning

Conference Proceedings of the International Joint Conference on Neural Networks · January 1, 2023 Transformer-based network architectures have gradually replaced convolutional neural networks in computer vision. Compared with convolutional neural networks, Transformer is able to learn global information of images and has better feature extraction capab ... Full text Cite

Improving Spoofing Capability for End-to-end Any-to-many Voice Conversion

Conference DDAM 2022 - Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia · October 14, 2022 Audio deep synthesis techniques have been able to generate highquality speech whose authenticity is difficult for humans to recognize. Meanwhile, many anti-spoofing systems have been developed to capture artifacts in the synthesized speech that are imperce ... Full text Cite

Deepfake Detection System for the ADD Challenge Track 3.2 Based on Score Fusion

Conference DDAM 2022 - Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia · October 14, 2022 This paper describes the deepfake audio detection system submitted to the Audio Deep Synthesis Detection (ADD) Challenge Track 3.2 and gives an analysis of score fusion. The proposed system is a score-level fusion of several light convolutional neural netw ... Full text Cite

Low-Latency Online Speaker Diarization with Graph-Based Label Generation

Conference The Speaker and Language Recognition Workshop (Odyssey 2022) · June 28, 2022 Full text Cite

Single-Channel Target Speaker Separation Using Joint Training with Target Speaker's Pitch Information

Conference The Speaker and Language Recognition Workshop (Odyssey 2022) · June 28, 2022 Full text Cite

Generating TTS Based Adversarial Samples for Training Wake-Up Word Detection Systems Against Confusing Words

Conference The Speaker and Language Recognition Workshop (Odyssey 2022) · June 28, 2022 Full text Cite

Paralinguistic singing attribute recognition using supervised machine learning for describing the classical tenor solo singing voice in vocal pedagogy.

Journal Article EURASIP journal on audio, speech, and music processing · January 2022 Humans can recognize someone's identity through their voice and describe the timbral phenomena of voices. Likewise, the singing voice also has timbral phenomena. In vocal pedagogy, vocal teachers listen and then describe the timbral phenomena of their stud ... Full text Cite

Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition

Journal Article IEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2022 The currentsuccess of deep learning largely benefits from the availability of large amount of labeled data. However, collecting a large-scale dataset with human annotation can be expensive and sometimes difficult. Self-supervised learning thus attracts man ... Full text Cite

THE DKU AUDIO-VISUAL WAKE WORD SPOTTING SYSTEM FOR THE 2021 MISP CHALLENGE

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2022 This paper describes the system developed by the DKU team for the MISP Challenge 2021. We present a two-stage approach consisting of end-to-end neural networks for the audio-visual wake word spotting task. We first process audio and video data to give them ... Full text Cite

TOWARDS LIGHTWEIGHT APPLICATIONS: ASYMMETRIC ENROLL-VERIFY STRUCTURE FOR SPEAKER VERIFICATION

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2022 With the development of deep learning, automatic speaker verification has made considerable progress over the past few years. However, to design a lightweight and robust system with limited computational resources is still a challenging problem. Traditiona ... Full text Cite

SIG-VC: A SPEAKER INFORMATION GUIDED ZERO-SHOT VOICE CONVERSION SYSTEM FOR BOTH HUMAN BEINGS AND MACHINES

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2022 Nowadays, as more and more systems achieve good performance in traditional voice conversion (VC) tasks, people's attention gradually turns to VC tasks under extreme conditions. In this paper, we propose a novel method for zero-shot voice conversion. We aim ... Full text Cite

CROSS-CHANNEL ATTENTION-BASED TARGET SPEAKER VOICE ACTIVITY DETECTION: EXPERIMENTAL RESULTS FOR THE M2MET CHALLENGE

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2022 In this paper, we present the speaker diarization system for the Multichannel Multi-party Meeting Transcription Challenge (M2MeT) from team DKU DukeECE. As the highly overlapped speech exists in the dataset, we employ an x-vector-based target-speaker voice ... Full text Cite

SIMPLE ATTENTION MODULE BASED SPEAKER VERIFICATION WITH ITERATIVE NOISY LABEL DETECTION

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2022 Recently, the attention mechanism such as squeeze-and-excitation module (SE) and convolutional block attention module (CBAM) has achieved great success in deep learning-based speaker verification system. This paper introduces an alternative effective yet s ... Full text Cite

INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2022 In this paper, we propose an end-to-end target-speaker voice activity detection (E2E-TS-VAD) method for speaker diarization. First, a ResNet-based network extracts the frame-level speaker embeddings from the acoustic features. Then, the L2-normalized frame ... Full text Cite

Similarity Measurement of Segment-Level Speaker Embeddings in Speaker Diarization

Journal Article IEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2022 In this paper, we propose a neural-network-based similarity measurement method to learn the similarity between any two speaker embeddings, where both previous and future contexts are considered. Moreover, we propose the segmental pooling strategy and joint ... Full text Cite

Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2022 Automatic speaker verification has achieved remarkable progress in recent years. However, there is little research on cross-age speaker verification (CASV) due to insufficient relevant data. In this paper, we mine cross-age test sets based on the VoxCeleb ... Full text Cite

The DKU-OPPO System for the 2022 Spoofing-Aware Speaker Verification Challenge

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2022 This paper describes our DKU-OPPO system for the 2022 Spoofing-Aware Speaker Verification (SASV) Challenge. First, we split the joint task into speaker verification (SV) and spoofing countermeasure (CM), these two tasks which are optimized separately. For ... Full text Cite

Online Target Speaker Voice Activity Detection for Speaker Diarization

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2022 This paper proposes an online target speaker voice activity detection system for speaker diarization tasks, which does not require a priori knowledge from the clustering-based diarization system to obtain the target speaker embeddings. First, we employ a R ... Full text Cite

A Multimodal Framework for Automated Teaching Quality Assessment of One-to-many Online Instruction Videos

Conference Proceedings - International Conference on Pattern Recognition · January 1, 2022 In the post-pandemic era, online courses have been adopted universally. Manually assessing online course teaching quality requires significant time and professional pedagogy experience. To address this problem, we design an evaluation protocol and propose ... Full text Cite

Source Tracing: Detecting Voice Spoofing

Conference Proceedings of 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2022 · January 1, 2022 Recent anti-spoofing systems focus on spoofing detection, where the task is only to determine whether the test audio is fake. However, there are few studies putting attention to identifying the methods of generating fake speech. Common spoofing attack algo ... Full text Cite

Low Pass Filtering and Bandwidth Extension for Robust Anti-spoofing Countermeasure Against Codec Variabilities

Conference 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 · January 1, 2022 A reliable voice anti-spoofing countermeasure system needs to robustly protect automatic speaker verification (ASV) systems in various kinds of spoofing scenarios. However, the performance of countermeasure systems could be degraded by channel effects and ... Full text Cite

Social Transformer: A Pedestrian Trajectory Prediction Method based on Social Feature Processing Using Transformer

Conference Proceedings of the International Joint Conference on Neural Networks · January 1, 2022 In pedestrian trajectory prediction, the prediction accuracy depends largely on the consideration of the impact of social relations on the prediction object. Social pooling and graph neural networks (GNN) are two traditional social feature processing metho ... Full text Cite

Cross-modal Assisted Training for Abnormal Event Recognition in Elevators

Conference ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction · October 18, 2021 Given that very few action recognition datasets collected in elevators contain multimodal data, we collect and propose our multimodal dataset investigating passenger safety and inappropriate elevator usage. Moreover, we present a novel framework (RGBP) to ... Full text Cite

A Multimodal Dynamic Neural Network for Call for Help Recognition in Elevators

Conference ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction · October 18, 2021 As elevator accidents do great damage to people's lives and property, taking immediate responses to emergent calls for help is necessary. In most emergency cases, passengers must use the "SOS"button to contact the remote safety guard. However, this method ... Full text Cite

Call for Help Detection in Emergent Situations Using Keyword Spotting and Paralinguistic Analysis

Conference ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction · October 18, 2021 Nowadays, the safety of passengers within the enclosed public space, such as the elevator, becomes more and more important. Though the passengers can click the "SOS"button to call the remote safety guard, the chances are that some passengers might lose the ... Full text Cite

The DKU-CMRI System for the ASVspoof 2021 Challenge: Vocoder based Replay Channel Response Estimation

Conference 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge · September 16, 2021 Full text Cite

Facial Expression Recognition with Identity and Emotion Joint Learning

Journal Article IEEE Transactions on Affective Computing · April 1, 2021 Different subjects may express a specific expression in different ways due to inter-subject variabilities. In this work, besides training deep-learned facial expression feature (emotional feature), we also consider the influence of latent face identity fea ... Full text Cite

Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection

Conference 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021 · January 24, 2021 In this paper, we propose a deep convolutional neural network-based acoustic word embedding system for code-switching query by example spoken term detection. Different from previous configurations, we combine audio data in two languages for training instea ... Full text Cite

Sams-Net: A Sliced Attention-based Neural Network for Music Source Separation

Conference 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021 · January 24, 2021 Convolutional Neural Network (CNN) or Long Short-term Memory (LSTM) based models with the input of spectrogram or waveforms are commonly used for deep learning based audio source separation. In this paper, we propose a Sliced Attention-based neural network ... Full text Cite

Embedding Aggregation for Far-Field Speaker Verification with Distributed Microphone Arrays

Conference 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings · January 19, 2021 With the successful application of deep speaker embedding networks, the performance of speaker verification systems has significantly improved under clean and close-talking settings; however, unsatisfactory performance persists under noisy and far-field en ... Full text Cite

Audio-Based Piano Performance Evaluation for Beginners with Convolutional Neural Network and Attention Mechanism

Journal Article IEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2021 In this paper, we propose two different audio-based piano performance evaluation systems for beginners. The first is a sequential and modularized system, including three steps: Convolutional Neural Network (CNN)-based acoustic feature extraction, matching ... Full text Cite

An iterative framework for self-supervised deep speaker representation learning

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2021 In this paper, we propose an iterative framework for self-supervised speaker representation learning based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between diffe ... Full text Cite

Our learned lessons from cross-lingual speaker verification: The CRMI-DKU System description for the short-duration speaker verification challenge 2021

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2021 In this paper, we present our CRMI-DKU system description for the Short-duration Speaker Verification Challenge (SdSVC) 2021. We introduce the whole pipeline of our cross-lingual speaker verification system, including data preprocessing, training strategy, ... Full text Cite

AISHELL-3: A multi-speaker Mandarin TTS corpus

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2021 In this paper, we present AISHELL-3 †, a large-scale multispeaker Mandarin speech corpus which could be used to train multi-speaker Text-To-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spanning across 218 native ... Full text Cite

The DKU-Duke-Lenovo system description for the fearless steps challenge phase III

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2021 This paper describes the systems developed by the DKU-Duke-Lenovo team for the Fearless Steps Challenge Phase III. For the speech activity detection (SAD) task, we employ the U-Net-based model which has not been used for SAD before, observing a DCF of 1.91 ... Full text Cite

Discriminative Dictionary Learning for Autism Spectrum Disorder Identification.

Journal Article Frontiers in computational neuroscience · January 2021 Autism Spectrum Disorder (ASD) is a group of lifelong neurodevelopmental disorders with complicated causes. A key symptom of ASD patients is their impaired interpersonal communication ability. Recent study shows that face scanning patterns of individuals w ... Full text Cite

The 2020 personalized voice trigger challenge: Open datasets, evaluation metrics, baseline system and results

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2021 The 2020 Personalized Voice Trigger Challenge (PVTC2020) addresses two different research problems in a unified setup: joint wake-up word detection with speaker verification on closetalking single microphone data and far-field multi-channel microphone arra ... Full text Cite

Binary Neural Network for Speaker Verification

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2021 Although deep neural networks are successful for many tasks in the speech domain, the high computational and memory costs of deep neural networks make it difficult to directly deploy highperformance Neural Network systems on low-resource embedded devices. ... Full text Cite

A Unified Deep Speaker Embedding Framework for Mixed-Bandwidth Speech Data

Conference 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings · January 1, 2021 This paper proposes a unified deep speaker em-bedding framework for modeling speech data with different sampling rates. Considering the narrowband spectrogram as a sub-image of the wideband spectrogram, we tackle the joint modeling problem of the mixed-ban ... Cite

End-to-End Mandarin Tone Classification with Short Term Con Information

Conference 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings · January 1, 2021 In this paper, we propose an end-to-end Mandarin tone classification method from continuous speech utterances utilizing both the spectrogram and the short-term con in-formation as the input. Both spectrograms and con segment features are used to train the ... Cite

AISHELL-3: A Multi-Speaker Mandarin TTS Corpus

Conference INTERSPEECH 2021 · 2021 Cite

Object Removal for Testing Object Detection in Autonomous Vehicle Systems

Conference Proceedings - 2021 21st International Conference on Software Quality, Reliability and Security Companion, QRS-C 2021 · January 1, 2021 An object detection system is a critical part of autonomous vehicle systems. To ensure the safety and efficiency of autonomous vehicles, object detection is required to satisfy high sensitivity and accuracy. However, the state-of-the-art object detection s ... Full text Cite

Graph Partition Convolution Neural Network for Pedestrian Trajectory Prediction

Conference Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI · January 1, 2021 In autonomous driving, the interaction of trajectory prediction has always served as the core. Designing a model to better capture the associated interactive information to improve the prediction accuracy is the key to the safety of autonomous driving. In ... Full text Cite

DIHARD II is Still Hard: Experimental Results and Discussions from the DKU-LENOVO Team

Conference The Speaker and Language Recognition Workshop (Odyssey 2020) · November 1, 2020 Full text Cite

Within-Sample Variability-Invariant Loss for Robust Speaker Recognition under Noisy Environments

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · May 1, 2020 Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under noisy environments. In this paper, we train the speaker embedding network to learn the "clean" embedding of the noisy utt ... Full text Cite

HI-MIA: A Far-Field Text-Dependent Speaker Verification Database and the Baselines

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · May 1, 2020 This paper presents a far-field text-dependent speaker verification database named HI-MIA. We aim to meet the data requirement for far-field microphone array based speaker verification since most of the publicly available databases are single channel close ... Full text Cite

On-the-Fly Data Loader and Utterance-Level Aggregation for Speaker and Language Recognition

Journal Article IEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2020 In this article, our recent efforts on directly modeling utterance-level aggregation for speaker and language recognition is summarized. First, an on-the-fly data loader for efficient network training is proposed. The data loader acts as a bridge between t ... Full text Cite

Atss-Net: Target speaker separation via attention-based neural network

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2020 Recently, Convolutional Neural Network (CNN) and Long short-term memory (LSTM) based models have been introduced to deep learning-based target speaker separation. In this paper, we propose an Attention-based neural network (Atss-Net) in the spectrogram dom ... Full text Cite

Domain aware training for far-field small-footprint keyword spotting

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2020 In this paper, we focus on the task of small-footprint keyword spotting under the far-field scenario. Far-field environments are commonly encountered in real-life speech applications, causing severe degradation of performance due to room reverberation and ... Full text Cite

The INTERSPEECH 2020 far-field speaker verification challenge

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2020 The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 2020) addresses three different research problems under well-defined conditions: far-field text-dependent speaker verification from single microphone array, far-field text-independent spe ... Full text Cite

The DKU speech activity detection and speaker identification systems for fearless steps challenge phase-02

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2020 This paper describes the systems developed by the DKU team for the Fearless Steps Challenge Phase-02 competition. For the Speech Activity Detection task, we start with the Long Short-Term Memory (LSTM) system and then apply the ResNet-LSTM improvement. Our ... Full text Cite

Self-attentive similarity measurement strategies in speaker diarization

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2020 Speaker diarization can be described as the process of extracting sequential speaker embeddings from an audio stream and clustering them according to speaker identities. Nowadays, deep neural network based approaches like x-vector have been widely adopted ... Full text Cite

From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2020 High-fidelity speech can be synthesized by end-to-end text-to-speech models in recent years. However, accessing and controlling speech attributes such as speaker identity, prosody, and emotion in a text-to-speech system remains a challenge. This paper pres ... Full text Cite

RWF-2000: An open large scale video database for violence detection

Conference Proceedings - International Conference on Pattern Recognition · January 1, 2020 In recent years, surveillance cameras are widely deployed in public places, and the general crime rate has been reduced significantly due to these ubiquitous devices. Usually, these cameras provide cues and evidence after crimes are conducted, while they a ... Full text Cite

Responsive social smile: A machine learning based multimodal behavior assessment framework towards early stage autism screening

Conference Proceedings - International Conference on Pattern Recognition · January 1, 2020 Autism spectrum disorder (ASD) is a neurodevelopmental disorder, which causes deficits in social lives. Early screening of ASD for young children is important to reduce the impact of ASD on people's lives. Traditional screening methods mainly rely on proto ... Full text Cite

DKU-Tencent submission to oriental language recognition AP18-OLR challenge

Conference 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019 · November 1, 2019 In this paper, we describe our submitted DKU- Tencent system for the oriental language recognition AP18- OLR Challenge. Our system pipeline consists of three main components, including data augmentation, frame-level feature extraction, and utterance-level ... Full text Cite

Deep neural networks with batch speaker normalization for intoxicated speech detection

Conference 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019 · November 1, 2019 Alcohol intoxication can affect people both physically and psychologically, and one's speech will also become different. However, detecting the intoxicated state from the speech is a challenging task. In this paper, we first implement the baseline model wi ... Full text Cite

Facial Expression Recognition with Identity and Spatial-temporal Integrated Learning

Conference 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, ACIIW 2019 · September 1, 2019 Spatial-temporal structure of expression frames plays a critical role in the task of video based facial expression recognition (FER). In this paper, we propose a 3D CNN based framework to learn the spatial-temporal structure from expression frames for vide ... Full text Cite

An automated assessment framework for atypical prosody and stereotyped idiosyncratic phrases related to autism spectrum disorder

Journal Article Computer Speech and Language · July 1, 2019 Autism Spectrum Disorder (ASD), a neurodevelopmental disability, has become one of the high incidence diseases among children. Studies indicate that early diagnosis and intervention treatments help to achieve positive longitudinal outcomes. In this paper, ... Full text Cite

F0 Contour Estimation Using Phonetic Feature in Electrolaryngeal Speech Enhancement

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · May 1, 2019 Pitch plays a significant role in understanding a tone based language like Mandarin. In this paper, we present a new method that estimates F0 contour for electrolaryngeal (EL) speech enhancement in Mandarin. Our system explores the usage of phonetic featur ... Full text Cite

Utterance-level End-to-end Language Identification Using Attention-based CNN-BLSTM

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · May 1, 2019 In this paper, we present an end-to-end language identification framework, the attention-based Convolutional Neural Network-Bidirectional Long-short Term Memory (CNN-BLSTM). The model is performed on the utterance level, which means the utterance-level dec ... Full text Cite

String Stability Analysis for Vehicle Platooning Under Unreliable Communication Links With Event-Triggered Strategy

Journal Article IEEE Transactions on Vehicular Technology · March 1, 2019 Vehicle platooning systems are often equipped with vehicle-to-vehicle (V2V) communication technologies to improve both the road efficiency and road safety by exchanging vehicle information over wireless networks to maintain relatively small inter-vehicle d ... Full text Cite

Fixation Based Object Recognition in Autism Clinic Setting

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2019 With the increasing popularity of portable eye tracking devices, one can conveniently use them to find fixation points, i.e., the location and region one is attracted by and looking at. However, region of interest alone is not enough to fully support furth ... Full text Cite

Polyphone disambiguation for Mandarin Chinese using conditional neural network with multi-level embedding features

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019 This paper describes a conditional neural network architecture for Mandarin Chinese polyphone disambiguation. The system is composed of a bidirectional recurrent neural network component acting as a sentence encoder to accumulate the context correlations, ... Full text Cite

The DKU-Lenovo systems for the INTERSPEECH 2019 computational paralinguistic challenge

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019 This paper introduces our approaches for the orca activity and continuous sleepiness tasks in the Interspeech ComParE Challenge 2019. For the orca activity detection task, we extract deep embeddings using several deep convolutional neural networks, followe ... Full text Cite

The DKU replay detection system for the asvspoof 2019 challenge: On data augmentation, feature representation, classification, and fusion

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019 This paper describes our DKU replay detection system for the ASVspoof 2019 challenge. The goal is to develop spoofing countermeasure for automatic speaker recognition in physical access scenario. We leverage the countermeasure system pipeline from four asp ... Full text Cite

Multi-channel training for end-to-end speaker recognition under reverberant and noisy environment

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019 Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under far-field scenarios due to the effects of the long range fading, room reverberation, and environmental noises. In this st ... Full text Cite

Far-field end-to-end text-dependent speaker verification based on mixed training data with transfer learning and enrollment data augmentation

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019 In this paper, we focus on the far-field end-to-end text-dependent speaker verification task with a small-scale far-field text dependent dataset and a large scale close-talking text independent database for training. First, we show that simulating far-fiel ... Full text Cite

The DKU system for the speaker recognition task of the 2019 voices from a distance challenge

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019 In this paper, we present the DKU system for the speaker recognition task of the VOiCES from a distance challenge 2019. We investigate the whole system pipeline for the far-field speaker verification, including data pre-processing, short-term spectral feat ... Full text Cite

LSTM based similarity measurement with spectral clustering for speaker diarization

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019 More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage, traditional algo ... Full text Cite

The DKU-SMIIP system for NIST 2018 speaker recognition evaluation

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019 In this paper, we present the system submission for the NIST 2018 Speaker Recognition Evaluation by DKU Speech and Multi-Modal Intelligent Information Processing (SMIIP) Lab. We explore various kinds of state-of-the-art front-end extractors as well as back ... Full text Cite

Insights in-to-End Learning Scheme for Language Identification

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · September 10, 2018 A novel interpretable end-to-end learning scheme for language identification is proposed. It is in line with the classical GMM i-vector methods both theoretically and practically. In the end-to-end pipeline, a general encoding layer is employed on top of t ... Full text Cite

A Novel Learnable Dictionary Encoding Layer for End-to-End Language Identification

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · September 10, 2018 A novel learnable dictionary encoding layer is proposed in this paper for end-to-end language identification. It is inline with the conventional GMM i-vector approach both theoretically and practically. We imitate the mechanism of traditional GMM training ... Full text Cite

Deep Speaker Embeddings with Convolutional Neural Network on Supervector for Text-Independent Speaker Recognition

Conference 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2018 - Proceedings · July 2, 2018 Lexical content variability in different utterances is the key challenge for text-independent speaker verification. In this paper, we investigate using supervector which has ability to reduce the impact of lexical content mismatch among different utterance ... Full text Cite

Unsupervised query by example spoken term detection using features concatenated with Self-Organizing Map distances

Conference 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings · July 2, 2018 In the task of the unsupervised query by example spoken term detection (QbE-STD), we concatenate the features extracted by a Self-Organizing Map (SOM) and features learned by an unsupervised GMM based model at the feature level to enhance the performance. ... Full text Cite

End-to-end language identification using NetFV and NetVLAD

Conference 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings · July 2, 2018 In this paper, we apply the NetFV and NetVLAD layers for the end-to-end language identification task. NetFV and NetVLAD layers are the differentiable implementations of the standard Fisher Vector and Vector of Locally Aggregated Descriptors (VLAD) methods, ... Full text Cite

The DKU-JNU-EMA electromagnetic articulography database on Mandarin and Chinese dialects with tandem feature based acoustic-to-articulatory inversion

Conference 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings · July 2, 2018 This paper presents the acquisition of the Duke Kunshan University Jinan University Electromagnetic Articulography (DKU-JNU-EMA) database in terms of aligned acoustics and articulatory data on Mandarin and Chinese dialects. This database currently includes ... Full text Cite

An efficient audio based performance evaluation system for computer assisted piano learning

Conference ICNC-FSKD 2017 - 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery · June 21, 2018 In this paper, we propose an audio based piano performance evaluation system for piano learning, aiming at giving objective feedbacks to the piano beginners so that their self-practicing could be more efficient. We target to build a system which could eval ... Full text Cite

Cancellable speech template via random binary orthogonal matrices projection hashing

Journal Article Pattern Recognition · April 1, 2018 The increasing advancement of mobile technology explosively popularizes the mobile devices (e.g. iPhone, iPad). A large number of mobile devices provide great convenience and cost effectiveness for the speaker recognition based applications. However, the c ... Full text Cite

Finite-time stability and stabilization of semi-Markovian jump systems with time delay

Journal Article International Journal of Robust and Nonlinear Control · April 1, 2018 Semi-Markovian jump systems are more general than Markovian jump systems in modeling practical systems. On the other hand, the finite-time stochastic stability is also more effective than stochastic stability in practical systems. This paper focuses on the ... Full text Cite

Construction and improvements of bird songs' classification system

Conference CEUR Workshop Proceedings · January 1, 2018 Detection of bird species with bird songs is a challenging and meaningful task. Two scenarios are presented in BirdCLEF challenge this year, which are monophone and soundscape. We trained convolutional neural network with both spectrograms extracted from r ... Cite

Analysis of length normalization in end-to-end speaker verification system

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2018 The classical i-vectors and the latest end-to-end deep speaker embeddings are the two representative categories of utterance-level representations in automatic speaker verification systems. Traditionally, once i-vectors or deep speaker embeddings are extra ... Full text Cite

An end-to-end deep learning framework with speech emotion recognition of atypical individuals

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2018 The goal of the ongoing ComParE 2018 Atypical Affect sub-challenge is to recognize the emotional states of atypical individuals. In this work, we present three modeling methods under the end-to-end learning framework, namely CNN combined with extended feat ... Full text Cite

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Conference Speaker and Language Recognition Workshop, ODYSSEY 2018 · January 1, 2018 In this paper, we explore the encoding/pooling layer and loss function in the end-to-end speaker and language recognition system. First, a unified and interpretable end-to-end system for both speaker and language recognition is developed. It accepts variab ... Full text Cite

SphereFace: Deep hypersphere embedding for face recognition

Conference Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 · November 6, 2017 This paper addresses deep face recognition (FR) problem under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. However, few existi ... Full text Cite

Mandarin electrolaryngeal voice conversion with combination of Gaussian mixture model and non-negative matrix factorization

Conference Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 · July 2, 2017 Electrolarynx (EL) is a speaking-aid device that helps laryngectomees who have their larynx removed to generate voice. However, the voice generated by EL is unnatural and unintelligible due to its flat pitch and strong vibration noise. Targeting these chal ... Full text Cite

Automatic emotional spoken language text corpus construction from written dialogs in fictions

Conference 2017 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017 · July 2, 2017 In this paper, we propose a novel method to automatically construct emotional spoken language text corpus from written dialogs, and release a large scale Chinese emotional text dataset with short conversations extracted from thousands of fictions using the ... Full text Cite

Response to name: A dataset and a multimodal machine learning framework towards autism study

Conference 2017 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017 · July 2, 2017 In this paper, we propose a 'Response to Name Dataset' for autism spectrum disorder (ASD) study as well as a multimodal ASD auxiliary screening system based on machine learning. ASD children are characterized by their impaired interpersonal communication a ... Full text Cite

Robust Real-Time Distributed Optimal Control Based Energy Management in a Smart Grid

Journal Article IEEE Transactions on Smart Grid · July 1, 2017 With the integration of distributed generations and controllable loads, the power grid becomes geographically distributed with a time-varying topology. The operation conditions may change rapidly and frequently; thus, management and control of the smart gr ... Full text Cite

Speaker diarization system for autism children's real-life audio data

Conference Proceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016 · May 2, 2017 In this paper, we introduce several methods to improve the performance of speaker diarization system for autism children's real-life audio data. This system serves as the frontend module for further speech analysis. Our objective is to detect the children' ... Full text Cite

Reconstruction of Lamb wave dispersion curves by sparse representation with continuity constraints.

Journal Article The Journal of the Acoustical Society of America · February 2017 Ultrasonic Lamb waves are a widely used research tool for nondestructive structural health monitoring. They travel long distances with little attenuation, enabling the interrogation of large areas. To analyze Lamb wave propagation data, it is often importa ... Full text Cite

Locality sensitive discriminant analysis for speaker verification

Conference 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 · January 17, 2017 In this paper, we apply Locality Sensitive Discriminant Analysis (LSDA) to speaker verification system for intersession variability compensation. As opposed to LDA which fails to discover the local geometrical structure of the data manifold, LSDA finds a p ... Full text Cite

An audio based piano performance evaluation method using deep neural network based acoustic modeling

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2017 In this paper, we propose an annotated piano performance evaluation dataset with 185 audio pieces and a method to evaluate the performance of piano beginners based on their audio recordings. The proposed framework includes three parts: piano key posterior ... Full text Cite

Countermeasures for automatic speaker verification replay spoofing attack : on data augmentation, feature representation, classification and fusion

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2017 The ongoing ASVspoof 2017 challenge aims to detect replay attacks for text dependent speaker verification. In this paper, we propose multiple replay spoofing countermeasure systems, with some of them boosting the CQCC-GMM baseline system after score level ... Full text Cite

End-To-End deep learning framework for speech paralinguistics detection based on perception aware spectrum

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2017 In this paper, we propose an end-To-end deep learning framework to detect speech paralinguistics using perception aware spectrum as input. Existing studies show that speech under cold has distinct variations of energy distribution on low frequency componen ... Full text Cite

A fast tracking algorithm for estimating ultrasonic signal time of flight in drilled shafts using Active Shape Models

Conference IEEE International Ultrasonics Symposium, IUS · November 1, 2016 Drilled shaft is an important substructure foundation in building construction. A drilled shaft needs to be placed precisely in high accuracy and satisfy the diameter precision requirement. In order to measure the verticality and the diameter of a shaft, t ... Full text Cite

Notice of Removal Efficient misalignment-robust face recognition via locality-constrained representation

Conference 2016 IEEE International Conference on Image Processing (ICIP) · September 2016 Full text Cite

Identifying children with autism spectrum disorder based on their face processing abnormality: A machine learning framework.

Journal Article Autism research : official journal of the International Society for Autism Research · August 2016 The atypical face scanning patterns in individuals with Autism Spectrum Disorder (ASD) has been repeatedly discovered by previous research. The present study examined whether their face scanning patterns could be potentially useful to identify children wit ... Full text Cite

Speaker verification based on the fusion of speech acoustics and inverted articulatory signals

Journal Article Computer Speech and Language · March 1, 2016 We propose a practical, feature-level and score-level fusion approach by combining acoustic and estimated articulatory information for both text independent and text dependent speaker verification. From a practical point of view, we study how to improve sp ... Full text Cite

Automatic assessment of non-native accent degrees using phonetic level posterior and duration features from multiple languages

Conference 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015 · February 19, 2016 This paper presents an automatic non-native accent assessment approach using phonetic level posterior and duration features. In this method, instead of using conventional MFCC trained Gaussian Mixture Models (GMM), we use phonetic phoneme states as tokens ... Full text Cite

The SYSU system for the interspeech 2015 automatic speaker verification spoofing and countermeasures challenge

Conference 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015 · February 19, 2016 Many existing speaker verification systems are reported to be vulnerable against different spoofing attacks, for example speech synthesis, voice conversion, play back, etc. In order to detect these spoofed speech signals as a countermeasure, we propose a s ... Full text Cite

Generalized I-vector Representation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker Verification

Journal Article Journal of Signal Processing Systems · February 1, 2016 This paper presents a generalized i-vector representation framework with phonetic tokenization and tandem features for text independent as well as text dependent speaker verification. In the conventional i-vector framework, the tokens for calculating the z ... Full text Cite

Entity disambiguation by knowledge and text jointly embedding

Conference CoNLL 2016 - 20th SIGNLL Conference on Computational Natural Language Learning, Proceedings · January 1, 2016 For most entity disambiguation systems, the secret recipes are feature representations for mentions and entities, most of which are based on Bag-of-Words (BoW) representations. Commonly, BoW has several drawbacks: (1) It ignores the intrinsic meaning of wo ... Full text Cite

Text-independent voice conversion using deep neural network based phonetic level features

Conference Proceedings - International Conference on Pattern Recognition · January 1, 2016 This paper presents a phonetically-aware joint density Gaussian mixture model (JD-GMM) framework for voice conversion that no longer requires parallel data from source speaker at the training stage. Considering that the phonetic level features contain text ... Full text Cite

On order-constrained transitive distance clustering

Conference 30th AAAI Conference on Artificial Intelligence, AAAI 2016 · January 1, 2016 We consider the problem of approximating order-constrained transitive distance (OCTD) and its clustering applications. Given any pairwise data, transitive distance (TD) is defined as the smallest possible "gap" on the set of paths connecting them. While su ... Cite

The SYSU system for CCPR 2016 multimodal emotion recognition challenge

Conference Communications in Computer and Information Science · January 1, 2016 In this paper, we propose a multimodal emotion recognition system that combines the information from the facial, text and speech data. First, we propose a residual network architecture within the convolutional neural networks (CNN) framework to improve the ... Full text Cite

Efficient autism spectrum disorder prediction with eye movement: A machine learning framework

Conference 2015 International Conference on Affective Computing and Intelligent Interaction, ACII 2015 · December 2, 2015 We propose an autism spectrum disorder (ASD) prediction system based on machine learning techniques. Our work features the novel development and application of machine learning methods over traditional ASD evaluation protocols. Specifically, we are interes ... Full text Cite

Innovations in the Use of Interactive Technology to Support Weight Management.

Journal Article Current obesity reports · December 2015 New and emerging mobile technologies are providing unprecedented possibilities for understanding and intervening on obesity-related behaviors in real time. However, the mobile health (mHealth) field has yet to catch up with the fast-paced development of te ... Full text Cite

Modified-prior plda based speaker recognition system

Journal Article Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/Journal of Tianjin University Science and Technology · August 15, 2015 To reduce the negative impact on the performance of speaker recognition systems due to the duration mismatch between enrollment utterance and test utterance, a modified-prior PLDA method is proposed. The probability distribution function of i-vector was mo ... Full text Cite

Speaker verification with the mixture of Gaussian factor analysis based representation

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · August 4, 2015 This paper presents a generalized i-vector representation framework using the mixture of Gaussian (MoG) factor analysis for speaker verification. Conventionally, a single standard factor analysis is adopted to generate a low rank total variability subspace ... Full text Cite

Pitch estimation based on harmonic salience

Journal Article Shengxue Xuebao/Acta Acustica · March 1, 2015 A method based on harmonic salience is proposed for extracting the fundamental frequency from speech signal. It first calculates the harmonic salience spectrum by a inhibiting factor, and summarizes the weighted salience of every harmonic partial. Finally ... Cite

Automatic intelligibility classification of sentence-level pathological speech

Journal Article Computer Speech and Language · January 1, 2015 Pathological speech usually refers to the condition of speech distortion resulting from atypicalities in voice and/or in the articulatory mechanisms owing to disease, illness or other physical or biological insult to the production system. Although automat ... Full text Cite

Locality constrained transitive distance clustering on speech data

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2015 The idea of developing unsupervised learning methods has received significant attention in recent years. An important application is whether one can train a high quality speaker verification model given large quantities of unlabeled speech data. Unsupervis ... Cite

Speech bandwidth expansion based on deep neural networks

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2015 This paper proposes a new speech bandwidth expansion method, which uses Deep Neural Networks (DNNs) to build high-order eigenspaces between the low frequency components and the high frequency components of the speech signal. A four-layer DNN is trained lay ... Cite

Modified-prior PLDA and score calibration for duration mismatch compensation in speaker recognition system

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2015 To deal with the performance degradation of speaker recognition due to duration mismatch between enrollment and test utterances, a novel strategy to modify the standard normal prior distribution of the i-vector during probabilistic linear discriminant anal ... Cite

Duration dependent covariance regularization in PLDA modeling for speaker verification

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2015 In this paper, we present a covariance regularized probabilistic linear discriminant analysis (CR-PLDA) model for text independent speaker verification. In the conventional simplified PLDA modeling, the covariance matrix used to capture the residual energi ... Cite

Melody extraction for vocal polyphonic music based on bayesian framework

Conference Proceedings - 2014 10th International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2014 · December 24, 2014 In order to automatically extract the main melody contours from polyphonic music especially vocal melody songs, we present an effective approach based on a Bayesian framework. According to various information from the music signals, we use a pitch evolutio ... Full text Cite

An iterative framework for unsupervised learning in the PLDA based speaker verification

Conference Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, ISCSLP 2014 · October 24, 2014 We present an iterative and unsupervised learning approach for the speaker verification task. In conventional speaker verification, Probabilistic Linear Discriminant Analysis (PLDA) has been widely used as a supervised backend. However, PLDA requires fully ... Full text Cite

Intoxicated speech detection: A fusion framework with speaker-normalized hierarchical functionals and GMM supervectors

Journal Article Computer Speech and Language · March 1, 2014 Segmental and suprasegmental speech signal modulations offer information about paralinguistic content such as affect, age and gender, pathology, and speaker state. Speaker state encompasses medium-term, temporary physiological phenomena influenced by inter ... Full text Cite

Verification based ECG biometrics with cardiac irregular conditions using heartbeat level and segment level information fusion

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2014 We propose an ECG based robust human verification system for both healthy and cardiac irregular conditions using the heartbeat level and segment level information fusion. At the heartbeat level, we first propose a novel beat normalization and outlier remov ... Full text Cite

Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification

Journal Article Computer Speech and Language · January 1, 2014 This paper presents a simplified and supervised i-vector modeling approach with applications to robust and efficient language identification and speaker verification. First, by concatenating the label vector and the linear regression matrix at the end of t ... Full text Cite

Speaker verification and spoken language identification using a generalized I-vector framework with phonetic tokenizations and tandem features

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2014 This paper presents a generalized i-vector framework with phonetic tokenizations and tandem features for speaker verification as well as language identification. First, the tokens for calculating the zero-order statistics is extended from the MFCC trained ... Cite

Automatic recognition of speaker physical load using posterior probability based features from acoustic and phonetic tokens

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2014 This paper presents an automatic speaker physical load recognition approach using posterior probability based features from acoustic and phonetic tokens. In this method, the tokens for calculating the posterior probability or zero-order statistics are exte ... Cite

Simplified and supervised i-vector modeling for speaker age regression

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2014 We propose a simplified and supervised i-vector modeling scheme for the speaker age regression task. The supervised i-vector is obtained by concatenating the label vector and the linear regression matrix at the end of the mean super-vector and the i-vector ... Full text Cite

Automatic vocal segments detection in popular music

Conference Proceedings - 9th International Conference on Computational Intelligence and Security, CIS 2013 · December 1, 2013 We propose a technique for the automatic vocal segments detection in an acoustical polyphonic music signal. We use a combination of several characteristics specific to singing voice as the feature and employ a Gaussian Mixture Model (GMM) classifier for vo ... Full text Cite

Pitch estimation based on harmonic salience

Journal Article Electronics Letters · November 7, 2013 The comb structure formed by the fundamental frequency and its harmonic partials in the spectrum is the important distinction between the pitch and the white noise or other coloured noises. A pitch estimation method based on harmonic salience is proposed w ... Full text Cite

Speaker verification using simplified and supervised i-vector modeling

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · October 18, 2013 This paper presents a simplified and supervised i-vector modeling framework that is applied in the task of robust and efficient speaker verification (SRE). First, by concatenating the mean supervector and the i-vector factor loading matrix with respectivel ... Full text Cite

Automatic speaker age and gender recognition using acoustic and prosodic level information fusion

Journal Article Computer Speech and Language · January 1, 2013 The paper presents a novel automatic speaker age and gender identification approach which combines seven different methods at both acoustic and prosodic levels to improve the baseline performance. The three baseline subsystems are (1) Gaussian mixture mode ... Full text Cite

Speaker verification based on fusion of acoustic and articulatory information

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2013 We propose a practical, feature-level fusion approach for combining acoustic and articulatory information in speaker verification task. We find that concatenating articulation features obtained from the measured speech production data with conventional Mel ... Cite

Multi-band long-term signal variability features for robust voice activity detection

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2013 In this paper, we propose robust features for the problem of voice activity detection (VAD). In particular, we extend the long term signal variability (LTSV) feature to accommodate multiple spectral bands. The motivation of the multi-band approach stems fr ... Cite

TRAP language identification system for RATS phase II evaluation

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2013 Automatic language identification or detection of au- dio data has become an important preprocessing step for speech/speaker recognition and audio data mining. In many surveillance applications, language detection has to be per- formed on highly degraded a ... Cite

Classifying language-related developmental disorders from speech cues: The promise and the potential confounds

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2013 Speech and spoken language cues offer a valuable means to measure and model human behavior. Computational models of speech behavior have the potential to support health care through assistive technologies, informed intervention, and efficient long-term mon ... Cite

Intelligibility classification of pathological speech using fusion of multiple subsystems

Conference 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 · December 1, 2012 Pathological speech usually refers to the condition of speech distortion resulting from atypicalities in voice and/or in the ar-ticulatory mechanisms owing to disease, illness or other physical or biological insult to the production system. While automatic ... Cite

Speaker personality classification using systems based on acoustic-lexical cues and an optimal tree-structured Bayesian network

Conference 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 · December 1, 2012 Automatic classification of human personality along the Big Five dimensions is an interesting problem with several practical applications. This paper makes some contributions in this regard. First, we propose a few automatically- derived personality-discri ... Cite

Speaker verification using Lasso based sparse total variability supervector with PLDA modeling

Conference 2012 Conference Handbook - Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2012 · December 1, 2012 In this paper, we propose a Lasso based framework to generate the sparse total variability supervectors (s-vectors). Rather than the factor analysis framework, which uses a low dimensional Eigenvoice subspace to represent the mean supervector, the proposed ... Cite

Speaker states recognition using latent factor analysis based Eigenchannel factor vector modeling

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · October 23, 2012 This paper presents an automatic speaker state recognition approach which models the factor vectors in the latent factor analysis framework improving upon the Gaussian Mixture Model (GMM) baseline performance. We investigate both intoxicated and affective ... Full text Cite

KNOWME: An energy-efficient multimodal body area network for physical activity monitoring

Journal Article Transactions on Embedded Computing Systems · August 1, 2012 The use of biometric sensors formonitoring an individual's health and related behaviors, continuously and in real time, promises to revolutionize healthcare in the near future. In an effort to better understand the complex interplay between one's medical c ... Full text Cite

KNOWME: A case study in wireless body area sensor network design

Journal Article IEEE Communications Magazine · May 16, 2012 Wireless body area sensing networks have the potential to revolutionize health care in the near term. The coupling of biosensors with a wireless infrastructure enables the real-time monitoring of an individual's health and related behaviors continuously, a ... Full text Cite

Recognition of physical activities in overweight Hispanic youth using KNOWME networks

Journal Article Journal of Physical Activity and Health · January 1, 2012 Background: KNOWME Networks is a wireless body area network with 2 triaxial accelerometers, a heart rate monitor, and mobile phone that acts as the data collection hub. One function of KNOWME Networks is to detect physical activity (PA) in overweight Hispa ... Full text Cite

Speaker verification using sparse representations on total variability I-vectors

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · December 1, 2011 In this paper, the sparse representation computed by l 1- minimization with quadratic constraints is employed to model the i-vectors in the low dimensional total variability space after performing the Within-Class Covariance Normalization and Linear Discri ... Cite

Intoxicated speech detection by fusion of speaker normalized hierarchical features and GMM supervectors

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · December 1, 2011 Speaker state recognition is a challenging problem due to speaker and context variability. Intoxication detection is an important area of paralinguistic speech research with potential real-world applications. In this work, we build upon a base set of vario ... Cite

Music structural segmentation by combining harmonic and timbral information

Conference Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR 2011 · December 1, 2011 We propose a novel model for music structural segmentation aiming at combining harmonic and timbral information. We use two-level clustering with splitting initialization and random turbulence to produce segment labels using chroma and MFCC separately as f ... Cite

Optimal time-resource allocation for activity-detection via multimodal sensing

Conference BODYNETS 2009 - 4th International ICST Conference on Body Area Networks · November 29, 2011 The optimal allocation of measurements for activity-level detection in a wireless body area network (WBAN) for health-monitoring applications is considered. The WBAN with heterogeneous sensors is deployed in a simple star topology with the fusion center re ... Full text Cite

Robust talking face video verification using joint factor analysis and sparse representation on GMM mean shifted supervectors

Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · August 18, 2011 It has been previously demonstrated that systems based on block wise local features and Gaussian mixture models (GMM) are suitable for video based talking face verification due to the best trade-off in terms of complexity, robustness and performance. In th ... Full text Cite

Optimal time-resource allocation for energy-efficient physical activity detection

Journal Article IEEE Transactions on Signal Processing · April 1, 2011 The optimal allocation of samples for physical activity detection in a wireless body area network for health-monitoring is considered. The number of biometric samples collected at the mobile device fusion center, from both device-internal and external Blue ... Full text Cite

Modeling high-level descriptions of real-life physical activities using latent topic modeling of multimodal sensor signals.

Conference Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference · January 2011 We propose a new methodology to model high-level descriptions of physical activities using multimodal sensor signals (ambulatory electrocardiogram (ECG) and accelerometer signals) obtained by a wearable wireless sensor network. We introduce a two-step stra ... Full text Cite

Robust ECG biometrics by fusing temporal and cepstral information

Conference Proceedings - International Conference on Pattern Recognition · November 18, 2010 The use of vital signs as a biometric is a potentially viable approach in a variety of application scenarios such as security and personalized health care. In this paper, a novel robust Electrocardiogram (ECG) biometric algorithm based on both temporal and ... Full text Cite

Multimodal physical activity recognition by fusing temporal and cepstral information.

Journal Article IEEE transactions on neural systems and rehabilitation engineering : a publication of the IEEE Engineering in Medicine and Biology Society · August 2010 A physical activity (PA) recognition algorithm for a wearable wireless sensor network using both ambulatory electrocardiogram (ECG) and accelerometer signals is proposed. First, in the time domain, the cardiac activity mean and the motion artifact noise of ... Full text Cite

Harmonic structure features for robust speaker recognition against channel effect

Conference 2nd International Symposium on Information Science and Engineering, ISISE 2009 · May 28, 2010 This paper proposes a novel feature set for robust speaker recognition, which is based on the harmonic structure of speech signals. Channel modulation effects are supposed to be weakened in the harmonic structure features, and furthermore the influence int ... Full text Cite

Long span prosodic features for speaker recognition

Journal Article Shengxue Xuebao/Acta Acustica · March 1, 2010 In this paper, we first give an introduction about speaker recognition techniques. Then a novel speaker verification method based on long span prosodic features is proposed. After speech is pre-processed by a voice activity detection module, and basic pros ... Cite

Music structure analysis based on timbre unit distribution

Journal Article Shengxue Xuebao/Acta Acustica · March 1, 2010 Music structure is not only an important form of the music works to express artists' ideas, but also is an effective way for listeners to understand the meaning of the music. This paper proposes a timbre unit modeling method based on musical features, usin ... Cite

Combining five acoustic level modeling methods for automatic speaker age and gender recognition

Conference Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010 · January 1, 2010 This paper presents a novel automatic speaker age and gender identification approach which combines five different methods at the acoustic level to improve the baseline performance. The five subsystems are (1) Gaussian mixture model (GMM) system based on m ... Cite

A review on objective music structure analysis

Conference 2009 International Conference on Information and Multimedia Technology, ICIMT 2009 · December 1, 2009 This paper summarizes the applications and the state of the art of objective music structure analysis. Two principal types of methods, namely "state" and "sequence" approaches are reviewed after applications are presented. Two kinds of objective features, ... Full text Cite

Optimal allocation of time-resources for multihypothesis activity-level detection

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · August 20, 2009 The optimal allocation of samples for activity-level detection in a wireless body area network for health-monitoring applications is considered. A wireless body area network with heterogeneous sensors is deployed in a simple star topology with the fusion c ... Full text Cite

Energy-efficient multihypothesis activity-detection for health-monitoring applications.

Conference Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference · January 2009 Multi-hypothesis activity-detection using a wireless body area network is considered. A fusion center receives samples of biometric signals from heterogeneous sensors. Due to the different discrimination capabilities of each sensor, an optimized allocation ... Full text Cite

Automatic singing performance evaluation for untrained singers

Journal Article IEICE Transactions on Information and Systems · January 1, 2009 In this letter, we present an automatic approach of objective singing performance evaluation for untrained singers by relating acoustic measurements to perceptual ratings of singing voice quality. Several acoustic parameters and their combination features ... Full text Cite

Using SVM as back-end classifier for language identification

Journal Article Eurasip Journal on Audio, Speech, and Music Processing · December 12, 2008 Robust automatic language identification (LID) is a task of identifying the language from a short utterance spoken by an unknown speaker. One of the mainstream approaches named parallel phone recognition language modeling (PPRLM) has achieved a very good p ... Full text Cite

Cochannel speech separation using multi-pitch estimation and model based voiced sequential grouping

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · December 1, 2008 In this paper, a new cochannel speech separation algorithm using multi-pitch extraction and speaker model based sequential grouping is proposed. After auditory segmentation based on onset and offset analysis, robust multi-pitch estimation algorithm is perf ... Cite

An objective singing evaluation approach by relating acoustic measurements to perceptual ratings

Conference Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · December 1, 2008 This paper presents an objective singing quality evaluation approach based on a study of the relationship between acoustic measurements and perceptual ratings of singing voice quality. Individual perceptual criteria's contributions to the overall rating ar ... Cite

A study on singing performance evaluation criteria for untrained singers

Conference International Conference on Signal Processing Proceedings, ICSP · December 1, 2008 This paper describes a study of subjective criteria for untrained singers' singing voice quality evaluation, focusing on the perceptual aspects that have relatively strong acoustic implications. And the correlation among the individual perceptual criteria ... Full text Cite

Automatic language identification with discriminative language characterization based on SVM

Journal Article IEICE Transactions on Information and Systems · January 1, 2008 Robust automatic language identification (LID) is the task of identifying the language from a short utterance spoken by an unknown speaker. The mainstream approaches include parallel phone recognition language modeling (PPRLM), support vector machine (SVM) ... Full text Cite

Melody track selection using discriminative language model

Journal Article IEICE Transactions on Information and Systems · January 1, 2008 In this letter we focus on the task of selecting the melody track from a polyphonic MIDI file. Based on the intuition that music and language are similar in many aspects, we solve the selection problem by introducing an n-gram language model to learn the m ... Full text Cite

Spoken language identification using score vector modeling and support vector machine

Conference International Speech Communication Association - 8th Annual Conference of the International Speech Communication Association, Interspeech 2007 · December 1, 2007 The support vector machine (SVM) framework based on generalized linear discriminate sequence (GLDS) kernel has been shown effective and widely used in language identification tasks. In this paper, in order to compensate the distortions due to inter-speaker ... Cite

Authentication and quality monitoring based on audio watermark for analog AM shortwave broadcasting

Conference Proceedings - 3rd International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIHMSP 2007. · December 1, 2007 A method based on audio watermarking techniques for authentication and monitoring broadcasting quality of existing analog amplitude modulation (AM) shortwave radio is presented. The content and number of extracted messages can be useful to authenticate the ... Full text Cite

Singing melody extraction in polyphonic music by harmonic tracking

Conference Proceedings of the 8th International Conference on Music Information Retrieval, ISMIR 2007 · December 1, 2007 This paper proposes an effective method for automatic melody extraction in polyphonic music, especially vocal melody songs. The method is based on subharmonic summation spectrum and harmonic structure tracking strategy. Performance of the method is evaluat ... Cite

The design of backend classifiers in PPRLM system for language identification

Conference Proceedings - Third International Conference on Natural Computation, ICNC 2007 · December 1, 2007 The design approach for classifying the backend features of the PPRLM (Parallel Phone Recognition and Language Modeling) system is demonstrated in this paper. A variety of features and their combinations extracted by language dependent recognizers were eva ... Full text Cite

A novel audio watermarking in wavelet domain

Conference Proceedings - 2006 International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2006 · December 1, 2006 A novel approach is proposed for robust audio watermarking in wavelet domain. It emphasizes on enhancing security by dynamically modifying embedding strategy. The modification is based on real-time changes of the watermark information and host audio. Witho ... Full text Cite

Iterative demodulation and decoding scheme with 16QAM

Journal Article Journal of Beijing Institute of Technology (English Edition) · September 1, 2006 Iterative demodulation and decoding scheme is analyzed and modulation labeling is considered to be one of the crucial factors to this scheme. By analyzing the existent mapping design criterion, four aspects are found as the key techniques for choosing a la ... Cite