ConferenceConference on Human Factors in Computing Systems - Proceedings · May 11, 2024
Autism Spectrum Disorder (ASD) presents challenges in social interaction skill development, particularly in turn-taking. Digital interventions offer potential solutions for improving autistic children's social skills but often lack addressing specific coll ...
Full textCite
Journal ArticleComputer Speech and Language · April 1, 2024
Partially fake audio, a variant of deep fake that involves manipulating audio utterances through the incorporation of fake or externally-sourced bona fide audio clips, constitutes a growing threat as an audio forgery attack impacting both human and artific ...
Full textOpen AccessCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2024
Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder that has gained global attention due to its prevalence. Clinical assessment measures rely heavily on manual scoring conducted by specialized physicians. However, this approach exhibits subject ...
Full textCite
ConferenceCommunications in Computer and Information Science · January 1, 2024
Current data augmentation methods for machine anomalous sound detection (MASD) suffer from insufficient data generated by real world machines. Open datasets such as audioset are not tailored for machine sounds, and fake sounds created by generative models ...
Full textCite
ConferenceCommunications in Computer and Information Science · January 1, 2024
This paper describes the DKU-MSXF submission to track 4 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). Our system pipeline contains voice activity detection, clustering-based diarization, overlapped speech detection, and target-speaker voi ...
Full textCite
ConferenceCommunications in Computer and Information Science · January 1, 2024
This paper introduces a real-time technique for simulating automotive engine sounds based on revolutions per minute (RPM) and pedal pressure data. We present a hybrid approach combining both sample-based and procedural methods. In the sample-based techniqu ...
Full textCite
Journal ArticleJournal of Shanghai Jiaotong University (Science) · January 1, 2024
The common target speech separation directly estimates the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation (MTSS) model to simultaneously extract each speaker’s voice fr ...
Full textCite
Journal ArticleIEEE Transactions on Affective Computing · January 1, 2024
One potential way to enhance the performance of facial expression recognition (FER) is to augment the training set by increasing the number of samples. By incorporating multiple FER datasets, deep learning models can extract more discriminative features. H ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2024
Personal voice activity detection (PVAD) is gradually used in speech assistants. Traditional PVAD schemes extract the target speaker's embedding from existing query reference speech through a pre-trained speaker verification model. Consequently, the perfor ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2024
This paper introduces an innovative deep learning framework for parallel voice conversion to mitigate inherent risks associated with such systems. Our approach focuses on developing an invertible model capable of countering potential spoofing threats. Spec ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2024
In contrast to human speech, machine-generated sounds of the same type often exhibit consistent frequency characteristics and discernible temporal periodicity. However, leveraging these dual attributes in anomaly detection remains relatively under-explored ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2024
In this paper, we introduce a novel approach that unifies Automatic Speech Recognition (ASR) and speaker diarization in a cohesive framework. Utilizing the synergies between the two tasks, our method effectively extracts speaker-specific information from t ...
Full textCite
Journal ArticleIEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2024
—The performance of speaker verification systems can be adversely affected by time domain variations. However, limited research has been conducted on time-varying speaker verification due to the absence of appropriate datasets. This paper aims to investiga ...
Full textCite
Journal ArticleIEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2024
—This paper focuses on the application of Conformers in speaker verification. Conformers, initially designed for Automatic Speech Recognition (ASR), excel at modeling both local and global contexts within speech signals effectively. Building on this synerg ...
Full textCite
Journal ArticleIEEE Transactions on Learning Technologies · January 1, 2024
Numerous children diagnosed with autism spectrum disorder (ASD) exhibit abnormal eye gaze pattern in communication and social interaction. In this study, we aim to investigate the effectiveness of the hide-and-seek virtual reality system (HSVRS) in improvi ...
Full textCite
Conference2024 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2024 · January 1, 2024
The paper describes the Wake Word Lipreading system developed by the WHU team for the ChatCLR Challenge 2024. Although Lipreading and Wake Word Spotting have seen significant development, exploration of pretrained frontends for Wake Word Lipreading (WWL) r ...
Full textCite
ConferenceProceedings of 2024 International Conference on Asian Language Processing, IALP 2024 · January 1, 2024
Dysarthria is a motor speech disorder commonly associated with conditions such as cerebral palsy, Parkinson's disease, amyotrophic lateral sclerosis, and stroke. Individuals with dysarthria typically exhibit significant speech difficulties, including impre ...
Full textCite
Journal ArticleJournal of speech, language, and hearing research : JSLHR · November 2023
PurposeThis cross-sectional study aimed to depict expressive language profiles and clarify lexical-grammatical interrelationships in Mandarin-speaking preschoolers with autism spectrum disorder (ASD) during the administration of the simplified Chi ...
Full textCite
Journal ArticleIEEE Transactions on Affective Computing · October 1, 2023
Behavioral observation plays an essential role in the diagnosis of Autism Spectrum Disorder (ASD) by analyzing children's atypical patterns in social activities (e.g., impaired social interaction, restricted interests, and repetitive behavior). To date, th ...
Full textCite
Journal ArticleIEEE Transactions on Cognitive and Developmental Systems · September 1, 2023
Estimating gaze from a low-resolution (LR) facial image is a challenging task. Most current networks for gaze estimation focus on using face images of adequate resolution. Their performance degrades when the image resolution decreases due to information lo ...
Full textCite
Journal ArticleIEEE Transactions on Affective Computing · April 1, 2023
Facial expression recognition (FER) accuracy is often affected by an individual's unique facial characteristics. Recognition performance can be improved if the influence from these physical characteristics is minimized. Using video instead of single image ...
Full textCite
Journal ArticleBiomedical Signal Processing and Control · February 1, 2023
An electrolarynx (EL) is a medical device that generates speech for people who lost their biological larynx. However, EL speech signals are unnatural and unintelligible due to the monotonous pitch and the mechanical excitation of the EL device. This paper ...
Full textCite
Journal ArticleIEEE Transactions on Multimedia · January 1, 2023
Head pose estimation is an important step for many human-computer interaction applications such as face detection, facial recognition, and facial expression classification. Accurate head pose estimation benefits these applications that require face images ...
Full textCite
Journal ArticleComputer Speech and Language · January 1, 2023
Modeling voices for multiple speakers and multiple languages with one speech synthesis system has been a challenge for a long time, especially in low-resource cases. This paper presents two approaches to achieve cross-lingual multi-speaker text-to-speech ( ...
Full textCite
Journal ArticleIEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2023
The popularity and application of smart home devices have made far-field speaker verification an urgent need. However, speaker verification performance is unsatisfactory under far-field environments despite its significant improvements enabled by deep neur ...
Full textCite
Journal ArticleFrontiers in psychiatry · January 2023
BackgroundReduced or absence of the response to name (RTN) has been widely reported as an early specific indicator for autism spectrum disorder (ASD), while few studies have quantified the RTN of toddlers with ASD in an automatic way. The present ...
Full textCite
Journal ArticleIEEE Transactions on Affective Computing · January 1, 2023
Capturing the dynamics of facial expression progression in video is an essential and challenging task for facial expression recognition (FER). In this article, we propose an effective framework to address this challenge. We develop a C3D-based network arch ...
Full textCite
ConferenceCommunications in Computer and Information Science · January 1, 2023
Textual escalation detection has been widely applied to e-commerce companies’ customer service systems to pre-alert and prevent potential conflicts. Similarly, acoustic-based escalation detection systems are also helpful in enhancing passengers’ safety and ...
Full textCite
ConferenceCommunications in Computer and Information Science · January 1, 2023
In this paper, we focus on improving the performance of the text-dependent speaker verification system in the scenario of limited training data. The deep learning based text-dependent speaker verification system generally needs a large-scale text-dependent ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2023
The accuracy and reliability of many speech processing systems may deteriorate under noisy conditions. This paper discusses robust audio anti-spoofing countermeasure for audio in noisy environments. Firstly, we attempt to use a pre-trained speech enhanceme ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2023
Most target speaker extraction methods use the target speaker embedding as reference information. However, the speaker embedding extracted by a speaker recognition module may not be optimal for the target speaker extraction tasks. In this paper, we propose ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2023
This paper proposes an approach for anomalous sound detection that incorporates outlier exposure and inlier modeling within a unified framework by multitask learning. While outlier exposure-based methods can extract features efficiently, it is not robust. ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023
This paper describes the system developed by the WHU-Alibaba team for the Multimodal Information Based Speech Processing (MISP) 2022 Challenge. We extend the Sequence-to-Sequence Target-Speaker Voice Activity Detection framework to simultaneously detect mu ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023
Target-speaker voice activity detection is currently a promising approach for speaker diarization in complex acoustic environments. This paper presents a novel Sequence-to-Sequence Target-Speaker Voice Activity Detection (Seq2Seq-TSVAD) method that can eff ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023
An automatic speaker verification system aims to verify the speaker identity of a speech signal. However, a voice conversion system could manipulate a person's speech signal to make it sound like another speaker's voice and deceive the speaker verification ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023
This paper proposes to pretrain Conformer with automatic speech recognition (ASR) task for speaker verification. Conformer combines convolution neural network (CNN) and Transformer model for modeling local and global features, respectively. Recently, multi ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023
Despite the great performance of language identification (LID), there is a lack of large-scale singing LID databases to support the research of singing language identification (SLID). This paper proposed a over 3200 hours dataset used for singing language ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023
The present paper proposes a waveform boundary detection system for audio spoofing attacks containing partially manipulated segments. Partially spoofed/fake audio, where part of the utterance is replaced, either with synthetic or natural audio clips, has r ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2023
This paper further explores our previous wake word spotting system ranked 2-nd in Track 1 of the MISP Challenge 2021. First, we investigate a robust unimodal approach based on 3D and 2D convolution and adopt the simple attention module (SimAM) for our syst ...
Full textCite
Conference2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023 · January 1, 2023
Most multi-channel speaker extraction schemes use the target speaker's location information as a reference, which must be known in advance or derived from visual cues. In addition, memory and computation costs are enormous when the model deals with the fus ...
Full textCite
ConferenceCEUR Workshop Proceedings · January 1, 2023
This paper presents our learned lessons from the ADD2023 track3, Deepfake Algorithm Recognition (AR). In recent years, speech synthesis has made remarkable progress, where it has become increasingly difficult for human listeners to differentiate between sy ...
Cite
Conference2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 · January 1, 2023
Although Singing Voice Synthesis (SVS) has made great strides with Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains relatively unexplored. This paper presents BiSinger, a bilingual pop SVS system for English and Chinese Mandarin ...
Full textCite
ConferenceProceedings - 2023 2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023 · January 1, 2023
Tumor cell detection plays a vital role in immunohistochemistry (IHC) quantitative analysis. While recent remarkable developments in fully-supervised deep learning have greatly contributed to the efficiency of this task, the necessity for manually annotati ...
Full textCite
Conference2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 · January 1, 2023
It is widely acknowledged that discriminative representation for speaker verification can be extracted from verbal speech. However, how much speaker information that non-verbal vocalization carries is still a puzzle. This paper explores speaker verificatio ...
Full textCite
ConferenceProceedings of the IEEE International Conference on Computer Vision · January 1, 2023
The paradigm of large-scale pre-training followed by downstream fine-tuning has been widely employed in various object detection algorithms. In this paper, we reveal discrepancies in data, model, and task between the pre-training and fine-tuning procedure ...
Full textCite
ConferenceProceedings of the International Joint Conference on Neural Networks · January 1, 2023
Transformer-based network architectures have gradually replaced convolutional neural networks in computer vision. Compared with convolutional neural networks, Transformer is able to learn global information of images and has better feature extraction capab ...
Full textCite
ConferenceDDAM 2022 - Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia · October 14, 2022
Audio deep synthesis techniques have been able to generate highquality speech whose authenticity is difficult for humans to recognize. Meanwhile, many anti-spoofing systems have been developed to capture artifacts in the synthesized speech that are imperce ...
Full textCite
ConferenceDDAM 2022 - Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia · October 14, 2022
This paper describes the deepfake audio detection system submitted to the Audio Deep Synthesis Detection (ADD) Challenge Track 3.2 and gives an analysis of score fusion. The proposed system is a score-level fusion of several light convolutional neural netw ...
Full textCite
Journal ArticleEURASIP journal on audio, speech, and music processing · January 2022
Humans can recognize someone's identity through their voice and describe the timbral phenomena of voices. Likewise, the singing voice also has timbral phenomena. In vocal pedagogy, vocal teachers listen and then describe the timbral phenomena of their stud ...
Full textCite
Journal ArticleIEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2022
The currentsuccess of deep learning largely benefits from the availability of large amount of labeled data. However, collecting a large-scale dataset with human annotation can be expensive and sometimes difficult. Self-supervised learning thus attracts man ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2022
This paper describes the system developed by the DKU team for the MISP Challenge 2021. We present a two-stage approach consisting of end-to-end neural networks for the audio-visual wake word spotting task. We first process audio and video data to give them ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2022
With the development of deep learning, automatic speaker verification has made considerable progress over the past few years. However, to design a lightweight and robust system with limited computational resources is still a challenging problem. Traditiona ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2022
Nowadays, as more and more systems achieve good performance in traditional voice conversion (VC) tasks, people's attention gradually turns to VC tasks under extreme conditions. In this paper, we propose a novel method for zero-shot voice conversion. We aim ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2022
In this paper, we present the speaker diarization system for the Multichannel Multi-party Meeting Transcription Challenge (M2MeT) from team DKU DukeECE. As the highly overlapped speech exists in the dataset, we employ an x-vector-based target-speaker voice ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2022
Recently, the attention mechanism such as squeeze-and-excitation module (SE) and convolutional block attention module (CBAM) has achieved great success in deep learning-based speaker verification system. This paper introduces an alternative effective yet s ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2022
In this paper, we propose an end-to-end target-speaker voice activity detection (E2E-TS-VAD) method for speaker diarization. First, a ResNet-based network extracts the frame-level speaker embeddings from the acoustic features. Then, the L2-normalized frame ...
Full textCite
Journal ArticleIEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2022
In this paper, we propose a neural-network-based similarity measurement method to learn the similarity between any two speaker embeddings, where both previous and future contexts are considered. Moreover, we propose the segmental pooling strategy and joint ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2022
Automatic speaker verification has achieved remarkable progress in recent years. However, there is little research on cross-age speaker verification (CASV) due to insufficient relevant data. In this paper, we mine cross-age test sets based on the VoxCeleb ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2022
This paper describes our DKU-OPPO system for the 2022 Spoofing-Aware Speaker Verification (SASV) Challenge. First, we split the joint task into speaker verification (SV) and spoofing countermeasure (CM), these two tasks which are optimized separately. For ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2022
This paper proposes an online target speaker voice activity detection system for speaker diarization tasks, which does not require a priori knowledge from the clustering-based diarization system to obtain the target speaker embeddings. First, we employ a R ...
Full textCite
ConferenceProceedings - International Conference on Pattern Recognition · January 1, 2022
In the post-pandemic era, online courses have been adopted universally. Manually assessing online course teaching quality requires significant time and professional pedagogy experience. To address this problem, we design an evaluation protocol and propose ...
Full textCite
ConferenceProceedings of 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2022 · January 1, 2022
Recent anti-spoofing systems focus on spoofing detection, where the task is only to determine whether the test audio is fake. However, there are few studies putting attention to identifying the methods of generating fake speech. Common spoofing attack algo ...
Full textCite
Conference2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 · January 1, 2022
A reliable voice anti-spoofing countermeasure system needs to robustly protect automatic speaker verification (ASV) systems in various kinds of spoofing scenarios. However, the performance of countermeasure systems could be degraded by channel effects and ...
Full textCite
ConferenceProceedings of the International Joint Conference on Neural Networks · January 1, 2022
In pedestrian trajectory prediction, the prediction accuracy depends largely on the consideration of the impact of social relations on the prediction object. Social pooling and graph neural networks (GNN) are two traditional social feature processing metho ...
Full textCite
ConferenceICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction · October 18, 2021
Given that very few action recognition datasets collected in elevators contain multimodal data, we collect and propose our multimodal dataset investigating passenger safety and inappropriate elevator usage. Moreover, we present a novel framework (RGBP) to ...
Full textCite
ConferenceICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction · October 18, 2021
As elevator accidents do great damage to people's lives and property, taking immediate responses to emergent calls for help is necessary. In most emergency cases, passengers must use the "SOS"button to contact the remote safety guard. However, this method ...
Full textCite
ConferenceICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction · October 18, 2021
Nowadays, the safety of passengers within the enclosed public space, such as the elevator, becomes more and more important. Though the passengers can click the "SOS"button to call the remote safety guard, the chances are that some passengers might lose the ...
Full textCite
Journal ArticleIEEE Transactions on Affective Computing · April 1, 2021
Different subjects may express a specific expression in different ways due to inter-subject variabilities. In this work, besides training deep-learned facial expression feature (emotional feature), we also consider the influence of latent face identity fea ...
Full textCite
Conference2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021 · January 24, 2021
In this paper, we propose a deep convolutional neural network-based acoustic word embedding system for code-switching query by example spoken term detection. Different from previous configurations, we combine audio data in two languages for training instea ...
Full textCite
Conference2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021 · January 24, 2021
Convolutional Neural Network (CNN) or Long Short-term Memory (LSTM) based models with the input of spectrogram or waveforms are commonly used for deep learning based audio source separation. In this paper, we propose a Sliced Attention-based neural network ...
Full textCite
Conference2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings · January 19, 2021
With the successful application of deep speaker embedding networks, the performance of speaker verification systems has significantly improved under clean and close-talking settings; however, unsatisfactory performance persists under noisy and far-field en ...
Full textCite
Journal ArticleIEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2021
In this paper, we propose two different audio-based piano performance evaluation systems for beginners. The first is a sequential and modularized system, including three steps: Convolutional Neural Network (CNN)-based acoustic feature extraction, matching ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2021
In this paper, we propose an iterative framework for self-supervised speaker representation learning based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between diffe ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2021
In this paper, we present our CRMI-DKU system description for the Short-duration Speaker Verification Challenge (SdSVC) 2021. We introduce the whole pipeline of our cross-lingual speaker verification system, including data preprocessing, training strategy, ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2021
In this paper, we present AISHELL-3 †, a large-scale multispeaker Mandarin speech corpus which could be used to train multi-speaker Text-To-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spanning across 218 native ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2021
This paper describes the systems developed by the DKU-Duke-Lenovo team for the Fearless Steps Challenge Phase III. For the speech activity detection (SAD) task, we employ the U-Net-based model which has not been used for SAD before, observing a DCF of 1.91 ...
Full textCite
Journal ArticleFrontiers in computational neuroscience · January 2021
Autism Spectrum Disorder (ASD) is a group of lifelong neurodevelopmental disorders with complicated causes. A key symptom of ASD patients is their impaired interpersonal communication ability. Recent study shows that face scanning patterns of individuals w ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2021
The 2020 Personalized Voice Trigger Challenge (PVTC2020) addresses two different research problems in a unified setup: joint wake-up word detection with speaker verification on closetalking single microphone data and far-field multi-channel microphone arra ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2021
Although deep neural networks are successful for many tasks in the speech domain, the high computational and memory costs of deep neural networks make it difficult to directly deploy highperformance Neural Network systems on low-resource embedded devices. ...
Full textCite
Conference2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings · January 1, 2021
This paper proposes a unified deep speaker em-bedding framework for modeling speech data with different sampling rates. Considering the narrowband spectrogram as a sub-image of the wideband spectrogram, we tackle the joint modeling problem of the mixed-ban ...
Cite
Conference2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings · January 1, 2021
In this paper, we propose an end-to-end Mandarin tone classification method from continuous speech utterances utilizing both the spectrogram and the short-term con in-formation as the input. Both spectrograms and con segment features are used to train the ...
Cite
ConferenceProceedings - 2021 21st International Conference on Software Quality, Reliability and Security Companion, QRS-C 2021 · January 1, 2021
An object detection system is a critical part of autonomous vehicle systems. To ensure the safety and efficiency of autonomous vehicles, object detection is required to satisfy high sensitivity and accuracy. However, the state-of-the-art object detection s ...
Full textCite
ConferenceProceedings - International Conference on Tools with Artificial Intelligence, ICTAI · January 1, 2021
In autonomous driving, the interaction of trajectory prediction has always served as the core. Designing a model to better capture the associated interactive information to improve the prediction accuracy is the key to the safety of autonomous driving. In ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · May 1, 2020
Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under noisy environments. In this paper, we train the speaker embedding network to learn the "clean" embedding of the noisy utt ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · May 1, 2020
This paper presents a far-field text-dependent speaker verification database named HI-MIA. We aim to meet the data requirement for far-field microphone array based speaker verification since most of the publicly available databases are single channel close ...
Full textCite
Journal ArticleIEEE/ACM Transactions on Audio Speech and Language Processing · January 1, 2020
In this article, our recent efforts on directly modeling utterance-level aggregation for speaker and language recognition is summarized. First, an on-the-fly data loader for efficient network training is proposed. The data loader acts as a bridge between t ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2020
Recently, Convolutional Neural Network (CNN) and Long short-term memory (LSTM) based models have been introduced to deep learning-based target speaker separation. In this paper, we propose an Attention-based neural network (Atss-Net) in the spectrogram dom ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2020
In this paper, we focus on the task of small-footprint keyword spotting under the far-field scenario. Far-field environments are commonly encountered in real-life speech applications, causing severe degradation of performance due to room reverberation and ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2020
The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 2020) addresses three different research problems under well-defined conditions: far-field text-dependent speaker verification from single microphone array, far-field text-independent spe ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2020
This paper describes the systems developed by the DKU team for the Fearless Steps Challenge Phase-02 competition. For the Speech Activity Detection task, we start with the Long Short-Term Memory (LSTM) system and then apply the ResNet-LSTM improvement. Our ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2020
Speaker diarization can be described as the process of extracting sequential speaker embeddings from an audio stream and clustering them according to speaker identities. Nowadays, deep neural network based approaches like x-vector have been widely adopted ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2020
High-fidelity speech can be synthesized by end-to-end text-to-speech models in recent years. However, accessing and controlling speech attributes such as speaker identity, prosody, and emotion in a text-to-speech system remains a challenge. This paper pres ...
Full textCite
ConferenceProceedings - International Conference on Pattern Recognition · January 1, 2020
In recent years, surveillance cameras are widely deployed in public places, and the general crime rate has been reduced significantly due to these ubiquitous devices. Usually, these cameras provide cues and evidence after crimes are conducted, while they a ...
Full textCite
ConferenceProceedings - International Conference on Pattern Recognition · January 1, 2020
Autism spectrum disorder (ASD) is a neurodevelopmental disorder, which causes deficits in social lives. Early screening of ASD for young children is important to reduce the impact of ASD on people's lives. Traditional screening methods mainly rely on proto ...
Full textCite
Conference2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019 · November 1, 2019
In this paper, we describe our submitted DKU- Tencent system for the oriental language recognition AP18- OLR Challenge. Our system pipeline consists of three main components, including data augmentation, frame-level feature extraction, and utterance-level ...
Full textCite
Conference2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019 · November 1, 2019
Alcohol intoxication can affect people both physically and psychologically, and one's speech will also become different. However, detecting the intoxicated state from the speech is a challenging task. In this paper, we first implement the baseline model wi ...
Full textCite
Conference2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, ACIIW 2019 · September 1, 2019
Spatial-temporal structure of expression frames plays a critical role in the task of video based facial expression recognition (FER). In this paper, we propose a 3D CNN based framework to learn the spatial-temporal structure from expression frames for vide ...
Full textCite
Journal ArticleComputer Speech and Language · July 1, 2019
Autism Spectrum Disorder (ASD), a neurodevelopmental disability, has become one of the high incidence diseases among children. Studies indicate that early diagnosis and intervention treatments help to achieve positive longitudinal outcomes. In this paper, ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · May 1, 2019
Pitch plays a significant role in understanding a tone based language like Mandarin. In this paper, we present a new method that estimates F0 contour for electrolaryngeal (EL) speech enhancement in Mandarin. Our system explores the usage of phonetic featur ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · May 1, 2019
In this paper, we present an end-to-end language identification framework, the attention-based Convolutional Neural Network-Bidirectional Long-short Term Memory (CNN-BLSTM). The model is performed on the utterance level, which means the utterance-level dec ...
Full textCite
Journal ArticleIEEE Transactions on Vehicular Technology · March 1, 2019
Vehicle platooning systems are often equipped with vehicle-to-vehicle (V2V) communication technologies to improve both the road efficiency and road safety by exchanging vehicle information over wireless networks to maintain relatively small inter-vehicle d ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2019
With the increasing popularity of portable eye tracking devices, one can conveniently use them to find fixation points, i.e., the location and region one is attracted by and looking at. However, region of interest alone is not enough to fully support furth ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019
This paper describes a conditional neural network architecture for Mandarin Chinese polyphone disambiguation. The system is composed of a bidirectional recurrent neural network component acting as a sentence encoder to accumulate the context correlations, ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019
This paper introduces our approaches for the orca activity and continuous sleepiness tasks in the Interspeech ComParE Challenge 2019. For the orca activity detection task, we extract deep embeddings using several deep convolutional neural networks, followe ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019
This paper describes our DKU replay detection system for the ASVspoof 2019 challenge. The goal is to develop spoofing countermeasure for automatic speaker recognition in physical access scenario. We leverage the countermeasure system pipeline from four asp ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019
Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under far-field scenarios due to the effects of the long range fading, room reverberation, and environmental noises. In this st ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019
In this paper, we focus on the far-field end-to-end text-dependent speaker verification task with a small-scale far-field text dependent dataset and a large scale close-talking text independent database for training. First, we show that simulating far-fiel ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019
In this paper, we present the DKU system for the speaker recognition task of the VOiCES from a distance challenge 2019. We investigate the whole system pipeline for the far-field speaker verification, including data pre-processing, short-term spectral feat ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019
More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage, traditional algo ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2019
In this paper, we present the system submission for the NIST 2018 Speaker Recognition Evaluation by DKU Speech and Multi-Modal Intelligent Information Processing (SMIIP) Lab. We explore various kinds of state-of-the-art front-end extractors as well as back ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · September 10, 2018
A novel interpretable end-to-end learning scheme for language identification is proposed. It is in line with the classical GMM i-vector methods both theoretically and practically. In the end-to-end pipeline, a general encoding layer is employed on top of t ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · September 10, 2018
A novel learnable dictionary encoding layer is proposed in this paper for end-to-end language identification. It is inline with the conventional GMM i-vector approach both theoretically and practically. We imitate the mechanism of traditional GMM training ...
Full textCite
Conference2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2018 - Proceedings · July 2, 2018
Lexical content variability in different utterances is the key challenge for text-independent speaker verification. In this paper, we investigate using supervector which has ability to reduce the impact of lexical content mismatch among different utterance ...
Full textCite
Conference2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings · July 2, 2018
In the task of the unsupervised query by example spoken term detection (QbE-STD), we concatenate the features extracted by a Self-Organizing Map (SOM) and features learned by an unsupervised GMM based model at the feature level to enhance the performance. ...
Full textCite
Conference2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings · July 2, 2018
In this paper, we apply the NetFV and NetVLAD layers for the end-to-end language identification task. NetFV and NetVLAD layers are the differentiable implementations of the standard Fisher Vector and Vector of Locally Aggregated Descriptors (VLAD) methods, ...
Full textCite
Conference2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings · July 2, 2018
This paper presents the acquisition of the Duke Kunshan University Jinan University Electromagnetic Articulography (DKU-JNU-EMA) database in terms of aligned acoustics and articulatory data on Mandarin and Chinese dialects. This database currently includes ...
Full textCite
ConferenceICNC-FSKD 2017 - 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery · June 21, 2018
In this paper, we propose an audio based piano performance evaluation system for piano learning, aiming at giving objective feedbacks to the piano beginners so that their self-practicing could be more efficient. We target to build a system which could eval ...
Full textCite
Journal ArticlePattern Recognition · April 1, 2018
The increasing advancement of mobile technology explosively popularizes the mobile devices (e.g. iPhone, iPad). A large number of mobile devices provide great convenience and cost effectiveness for the speaker recognition based applications. However, the c ...
Full textCite
Journal ArticleInternational Journal of Robust and Nonlinear Control · April 1, 2018
Semi-Markovian jump systems are more general than Markovian jump systems in modeling practical systems. On the other hand, the finite-time stochastic stability is also more effective than stochastic stability in practical systems. This paper focuses on the ...
Full textCite
ConferenceCEUR Workshop Proceedings · January 1, 2018
Detection of bird species with bird songs is a challenging and meaningful task. Two scenarios are presented in BirdCLEF challenge this year, which are monophone and soundscape. We trained convolutional neural network with both spectrograms extracted from r ...
Cite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2018
The classical i-vectors and the latest end-to-end deep speaker embeddings are the two representative categories of utterance-level representations in automatic speaker verification systems. Traditionally, once i-vectors or deep speaker embeddings are extra ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2018
The goal of the ongoing ComParE 2018 Atypical Affect sub-challenge is to recognize the emotional states of atypical individuals. In this work, we present three modeling methods under the end-to-end learning framework, namely CNN combined with extended feat ...
Full textCite
ConferenceSpeaker and Language Recognition Workshop, ODYSSEY 2018 · January 1, 2018
In this paper, we explore the encoding/pooling layer and loss function in the end-to-end speaker and language recognition system. First, a unified and interpretable end-to-end system for both speaker and language recognition is developed. It accepts variab ...
Full textCite
ConferenceProceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 · November 6, 2017
This paper addresses deep face recognition (FR) problem under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. However, few existi ...
Full textCite
ConferenceProceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 · July 2, 2017
Electrolarynx (EL) is a speaking-aid device that helps laryngectomees who have their larynx removed to generate voice. However, the voice generated by EL is unnatural and unintelligible due to its flat pitch and strong vibration noise. Targeting these chal ...
Full textCite
Conference2017 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017 · July 2, 2017
In this paper, we propose a novel method to automatically construct emotional spoken language text corpus from written dialogs, and release a large scale Chinese emotional text dataset with short conversations extracted from thousands of fictions using the ...
Full textCite
Conference2017 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017 · July 2, 2017
In this paper, we propose a 'Response to Name Dataset' for autism spectrum disorder (ASD) study as well as a multimodal ASD auxiliary screening system based on machine learning. ASD children are characterized by their impaired interpersonal communication a ...
Full textCite
Journal ArticleIEEE Transactions on Smart Grid · July 1, 2017
With the integration of distributed generations and controllable loads, the power grid becomes geographically distributed with a time-varying topology. The operation conditions may change rapidly and frequently; thus, management and control of the smart gr ...
Full textCite
ConferenceProceedings of 2016 10th International Symposium on Chinese Spoken Language Processing, ISCSLP 2016 · May 2, 2017
In this paper, we introduce several methods to improve the performance of speaker diarization system for autism children's real-life audio data. This system serves as the frontend module for further speech analysis. Our objective is to detect the children' ...
Full textCite
Journal ArticleThe Journal of the Acoustical Society of America · February 2017
Ultrasonic Lamb waves are a widely used research tool for nondestructive structural health monitoring. They travel long distances with little attenuation, enabling the interrogation of large areas. To analyze Lamb wave propagation data, it is often importa ...
Full textCite
Conference2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 · January 17, 2017
In this paper, we apply Locality Sensitive Discriminant Analysis (LSDA) to speaker verification system for intersession variability compensation. As opposed to LDA which fails to discover the local geometrical structure of the data manifold, LSDA finds a p ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2017
In this paper, we propose an annotated piano performance evaluation dataset with 185 audio pieces and a method to evaluate the performance of piano beginners based on their audio recordings. The proposed framework includes three parts: piano key posterior ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2017
The ongoing ASVspoof 2017 challenge aims to detect replay attacks for text dependent speaker verification. In this paper, we propose multiple replay spoofing countermeasure systems, with some of them boosting the CQCC-GMM baseline system after score level ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2017
In this paper, we propose an end-To-end deep learning framework to detect speech paralinguistics using perception aware spectrum as input. Existing studies show that speech under cold has distinct variations of energy distribution on low frequency componen ...
Full textCite
ConferenceIEEE International Ultrasonics Symposium, IUS · November 1, 2016
Drilled shaft is an important substructure foundation in building construction. A drilled shaft needs to be placed precisely in high accuracy and satisfy the diameter precision requirement. In order to measure the verticality and the diameter of a shaft, t ...
Full textCite
Journal ArticleAutism research : official journal of the International Society for Autism Research · August 2016
The atypical face scanning patterns in individuals with Autism Spectrum Disorder (ASD) has been repeatedly discovered by previous research. The present study examined whether their face scanning patterns could be potentially useful to identify children wit ...
Full textCite
Journal ArticleComputer Speech and Language · March 1, 2016
We propose a practical, feature-level and score-level fusion approach by combining acoustic and estimated articulatory information for both text independent and text dependent speaker verification. From a practical point of view, we study how to improve sp ...
Full textCite
Conference2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015 · February 19, 2016
This paper presents an automatic non-native accent assessment approach using phonetic level posterior and duration features. In this method, instead of using conventional MFCC trained Gaussian Mixture Models (GMM), we use phonetic phoneme states as tokens ...
Full textCite
Conference2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015 · February 19, 2016
Many existing speaker verification systems are reported to be vulnerable against different spoofing attacks, for example speech synthesis, voice conversion, play back, etc. In order to detect these spoofed speech signals as a countermeasure, we propose a s ...
Full textCite
Journal ArticleJournal of Signal Processing Systems · February 1, 2016
This paper presents a generalized i-vector representation framework with phonetic tokenization and tandem features for text independent as well as text dependent speaker verification. In the conventional i-vector framework, the tokens for calculating the z ...
Full textCite
ConferenceCoNLL 2016 - 20th SIGNLL Conference on Computational Natural Language Learning, Proceedings · January 1, 2016
For most entity disambiguation systems, the secret recipes are feature representations for mentions and entities, most of which are based on Bag-of-Words (BoW) representations. Commonly, BoW has several drawbacks: (1) It ignores the intrinsic meaning of wo ...
Full textCite
ConferenceProceedings - International Conference on Pattern Recognition · January 1, 2016
This paper presents a phonetically-aware joint density Gaussian mixture model (JD-GMM) framework for voice conversion that no longer requires parallel data from source speaker at the training stage. Considering that the phonetic level features contain text ...
Full textCite
Conference30th AAAI Conference on Artificial Intelligence, AAAI 2016 · January 1, 2016
We consider the problem of approximating order-constrained transitive distance (OCTD) and its clustering applications. Given any pairwise data, transitive distance (TD) is defined as the smallest possible "gap" on the set of paths connecting them. While su ...
Cite
ConferenceCommunications in Computer and Information Science · January 1, 2016
In this paper, we propose a multimodal emotion recognition system that combines the information from the facial, text and speech data. First, we propose a residual network architecture within the convolutional neural networks (CNN) framework to improve the ...
Full textCite
Conference2015 International Conference on Affective Computing and Intelligent Interaction, ACII 2015 · December 2, 2015
We propose an autism spectrum disorder (ASD) prediction system based on machine learning techniques. Our work features the novel development and application of machine learning methods over traditional ASD evaluation protocols. Specifically, we are interes ...
Full textCite
Journal ArticleCurrent obesity reports · December 2015
New and emerging mobile technologies are providing unprecedented possibilities for understanding and intervening on obesity-related behaviors in real time. However, the mobile health (mHealth) field has yet to catch up with the fast-paced development of te ...
Full textCite
Journal ArticleTianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/Journal of Tianjin University Science and Technology · August 15, 2015
To reduce the negative impact on the performance of speaker recognition systems due to the duration mismatch between enrollment utterance and test utterance, a modified-prior PLDA method is proposed. The probability distribution function of i-vector was mo ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · August 4, 2015
This paper presents a generalized i-vector representation framework using the mixture of Gaussian (MoG) factor analysis for speaker verification. Conventionally, a single standard factor analysis is adopted to generate a low rank total variability subspace ...
Full textCite
Journal ArticleShengxue Xuebao/Acta Acustica · March 1, 2015
A method based on harmonic salience is proposed for extracting the fundamental frequency from speech signal. It first calculates the harmonic salience spectrum by a inhibiting factor, and summarizes the weighted salience of every harmonic partial. Finally ...
Cite
Journal ArticleComputer Speech and Language · January 1, 2015
Pathological speech usually refers to the condition of speech distortion resulting from atypicalities in voice and/or in the articulatory mechanisms owing to disease, illness or other physical or biological insult to the production system. Although automat ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2015
The idea of developing unsupervised learning methods has received significant attention in recent years. An important application is whether one can train a high quality speaker verification model given large quantities of unlabeled speech data. Unsupervis ...
Cite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2015
This paper proposes a new speech bandwidth expansion method, which uses Deep Neural Networks (DNNs) to build high-order eigenspaces between the low frequency components and the high frequency components of the speech signal. A four-layer DNN is trained lay ...
Cite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2015
To deal with the performance degradation of speaker recognition due to duration mismatch between enrollment and test utterances, a novel strategy to modify the standard normal prior distribution of the i-vector during probabilistic linear discriminant anal ...
Cite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2015
In this paper, we present a covariance regularized probabilistic linear discriminant analysis (CR-PLDA) model for text independent speaker verification. In the conventional simplified PLDA modeling, the covariance matrix used to capture the residual energi ...
Cite
ConferenceProceedings - 2014 10th International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2014 · December 24, 2014
In order to automatically extract the main melody contours from polyphonic music especially vocal melody songs, we present an effective approach based on a Bayesian framework. According to various information from the music signals, we use a pitch evolutio ...
Full textCite
ConferenceProceedings of the 9th International Symposium on Chinese Spoken Language Processing, ISCSLP 2014 · October 24, 2014
We present an iterative and unsupervised learning approach for the speaker verification task. In conventional speaker verification, Probabilistic Linear Discriminant Analysis (PLDA) has been widely used as a supervised backend. However, PLDA requires fully ...
Full textCite
Journal ArticleComputer Speech and Language · March 1, 2014
Segmental and suprasegmental speech signal modulations offer information about paralinguistic content such as affect, age and gender, pathology, and speaker state. Speaker state encompasses medium-term, temporary physiological phenomena influenced by inter ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2014
We propose an ECG based robust human verification system for both healthy and cardiac irregular conditions using the heartbeat level and segment level information fusion. At the heartbeat level, we first propose a novel beat normalization and outlier remov ...
Full textCite
Journal ArticleComputer Speech and Language · January 1, 2014
This paper presents a simplified and supervised i-vector modeling approach with applications to robust and efficient language identification and speaker verification. First, by concatenating the label vector and the linear regression matrix at the end of t ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2014
This paper presents a generalized i-vector framework with phonetic tokenizations and tandem features for speaker verification as well as language identification. First, the tokens for calculating the zero-order statistics is extended from the MFCC trained ...
Cite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2014
This paper presents an automatic speaker physical load recognition approach using posterior probability based features from acoustic and phonetic tokens. In this method, the tokens for calculating the posterior probability or zero-order statistics are exte ...
Cite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · January 1, 2014
We propose a simplified and supervised i-vector modeling scheme for the speaker age regression task. The supervised i-vector is obtained by concatenating the label vector and the linear regression matrix at the end of the mean super-vector and the i-vector ...
Full textCite
ConferenceProceedings - 9th International Conference on Computational Intelligence and Security, CIS 2013 · December 1, 2013
We propose a technique for the automatic vocal segments detection in an acoustical polyphonic music signal. We use a combination of several characteristics specific to singing voice as the feature and employ a Gaussian Mixture Model (GMM) classifier for vo ...
Full textCite
Journal ArticleElectronics Letters · November 7, 2013
The comb structure formed by the fundamental frequency and its harmonic partials in the spectrum is the important distinction between the pitch and the white noise or other coloured noises. A pitch estimation method based on harmonic salience is proposed w ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · October 18, 2013
This paper presents a simplified and supervised i-vector modeling framework that is applied in the task of robust and efficient speaker verification (SRE). First, by concatenating the mean supervector and the i-vector factor loading matrix with respectivel ...
Full textCite
Journal ArticleComputer Speech and Language · January 1, 2013
The paper presents a novel automatic speaker age and gender identification approach which combines seven different methods at both acoustic and prosodic levels to improve the baseline performance. The three baseline subsystems are (1) Gaussian mixture mode ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2013
We propose a practical, feature-level fusion approach for combining acoustic and articulatory information in speaker verification task. We find that concatenating articulation features obtained from the measured speech production data with conventional Mel ...
Cite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2013
In this paper, we propose robust features for the problem of voice activity detection (VAD). In particular, we extend the long term signal variability (LTSV) feature to accommodate multiple spectral bands. The motivation of the multi-band approach stems fr ...
Cite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2013
Automatic language identification or detection of au- dio data has become an important preprocessing step for speech/speaker recognition and audio data mining. In many surveillance applications, language detection has to be per- formed on highly degraded a ...
Cite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · January 1, 2013
Speech and spoken language cues offer a valuable means to measure and model human behavior. Computational models of speech behavior have the potential to support health care through assistive technologies, informed intervention, and efficient long-term mon ...
Cite
Conference13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 · December 1, 2012
Pathological speech usually refers to the condition of speech distortion resulting from atypicalities in voice and/or in the ar-ticulatory mechanisms owing to disease, illness or other physical or biological insult to the production system. While automatic ...
Cite
Conference13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 · December 1, 2012
Automatic classification of human personality along the Big Five dimensions is an interesting problem with several practical applications. This paper makes some contributions in this regard. First, we propose a few automatically- derived personality-discri ...
Cite
Conference2012 Conference Handbook - Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2012 · December 1, 2012
In this paper, we propose a Lasso based framework to generate the sparse total variability supervectors (s-vectors). Rather than the factor analysis framework, which uses a low dimensional Eigenvoice subspace to represent the mean supervector, the proposed ...
Cite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · October 23, 2012
This paper presents an automatic speaker state recognition approach which models the factor vectors in the latent factor analysis framework improving upon the Gaussian Mixture Model (GMM) baseline performance. We investigate both intoxicated and affective ...
Full textCite
Journal ArticleTransactions on Embedded Computing Systems · August 1, 2012
The use of biometric sensors formonitoring an individual's health and related behaviors, continuously and in real time, promises to revolutionize healthcare in the near future. In an effort to better understand the complex interplay between one's medical c ...
Full textCite
Journal ArticleIEEE Communications Magazine · May 16, 2012
Wireless body area sensing networks have the potential to revolutionize health care in the near term. The coupling of biosensors with a wireless infrastructure enables the real-time monitoring of an individual's health and related behaviors continuously, a ...
Full textCite
Journal ArticleJournal of Physical Activity and Health · January 1, 2012
Background: KNOWME Networks is a wireless body area network with 2 triaxial accelerometers, a heart rate monitor, and mobile phone that acts as the data collection hub. One function of KNOWME Networks is to detect physical activity (PA) in overweight Hispa ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · December 1, 2011
In this paper, the sparse representation computed by l 1- minimization with quadratic constraints is employed to model the i-vectors in the low dimensional total variability space after performing the Within-Class Covariance Normalization and Linear Discri ...
Cite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · December 1, 2011
Speaker state recognition is a challenging problem due to speaker and context variability. Intoxication detection is an important area of paralinguistic speech research with potential real-world applications. In this work, we build upon a base set of vario ...
Cite
ConferenceProceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR 2011 · December 1, 2011
We propose a novel model for music structural segmentation aiming at combining harmonic and timbral information. We use two-level clustering with splitting initialization and random turbulence to produce segment labels using chroma and MFCC separately as f ...
Cite
ConferenceBODYNETS 2009 - 4th International ICST Conference on Body Area Networks · November 29, 2011
The optimal allocation of measurements for activity-level detection in a wireless body area network (WBAN) for health-monitoring applications is considered. The WBAN with heterogeneous sensors is deployed in a simple star topology with the fusion center re ...
Full textCite
ConferenceICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings · August 18, 2011
It has been previously demonstrated that systems based on block wise local features and Gaussian mixture models (GMM) are suitable for video based talking face verification due to the best trade-off in terms of complexity, robustness and performance. In th ...
Full textCite
Journal ArticleIEEE Transactions on Signal Processing · April 1, 2011
The optimal allocation of samples for physical activity detection in a wireless body area network for health-monitoring is considered. The number of biometric samples collected at the mobile device fusion center, from both device-internal and external Blue ...
Full textCite
ConferenceAnnual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference · January 2011
We propose a new methodology to model high-level descriptions of physical activities using multimodal sensor signals (ambulatory electrocardiogram (ECG) and accelerometer signals) obtained by a wearable wireless sensor network. We introduce a two-step stra ...
Full textCite
ConferenceProceedings - International Conference on Pattern Recognition · November 18, 2010
The use of vital signs as a biometric is a potentially viable approach in a variety of application scenarios such as security and personalized health care. In this paper, a novel robust Electrocardiogram (ECG) biometric algorithm based on both temporal and ...
Full textCite
Journal ArticleIEEE transactions on neural systems and rehabilitation engineering : a publication of the IEEE Engineering in Medicine and Biology Society · August 2010
A physical activity (PA) recognition algorithm for a wearable wireless sensor network using both ambulatory electrocardiogram (ECG) and accelerometer signals is proposed. First, in the time domain, the cardiac activity mean and the motion artifact noise of ...
Full textCite
Conference2nd International Symposium on Information Science and Engineering, ISISE 2009 · May 28, 2010
This paper proposes a novel feature set for robust speaker recognition, which is based on the harmonic structure of speech signals. Channel modulation effects are supposed to be weakened in the harmonic structure features, and furthermore the influence int ...
Full textCite
Journal ArticleShengxue Xuebao/Acta Acustica · March 1, 2010
In this paper, we first give an introduction about speaker recognition techniques. Then a novel speaker verification method based on long span prosodic features is proposed. After speech is pre-processed by a voice activity detection module, and basic pros ...
Cite
Journal ArticleShengxue Xuebao/Acta Acustica · March 1, 2010
Music structure is not only an important form of the music works to express artists' ideas, but also is an effective way for listeners to understand the meaning of the music. This paper proposes a timbre unit modeling method based on musical features, usin ...
Cite
ConferenceProceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010 · January 1, 2010
This paper presents a novel automatic speaker age and gender identification approach which combines five different methods at the acoustic level to improve the baseline performance. The five subsystems are (1) Gaussian mixture model (GMM) system based on m ...
Cite
Conference2009 International Conference on Information and Multimedia Technology, ICIMT 2009 · December 1, 2009
This paper summarizes the applications and the state of the art of objective music structure analysis. Two principal types of methods, namely "state" and "sequence" approaches are reviewed after applications are presented. Two kinds of objective features, ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · August 20, 2009
The optimal allocation of samples for activity-level detection in a wireless body area network for health-monitoring applications is considered. A wireless body area network with heterogeneous sensors is deployed in a simple star topology with the fusion c ...
Full textCite
ConferenceAnnual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference · January 2009
Multi-hypothesis activity-detection using a wireless body area network is considered. A fusion center receives samples of biometric signals from heterogeneous sensors. Due to the different discrimination capabilities of each sensor, an optimized allocation ...
Full textCite
Journal ArticleIEICE Transactions on Information and Systems · January 1, 2009
In this letter, we present an automatic approach of objective singing performance evaluation for untrained singers by relating acoustic measurements to perceptual ratings of singing voice quality. Several acoustic parameters and their combination features ...
Full textCite
Journal ArticleEurasip Journal on Audio, Speech, and Music Processing · December 12, 2008
Robust automatic language identification (LID) is a task of identifying the language from a short utterance spoken by an unknown speaker. One of the mainstream approaches named parallel phone recognition language modeling (PPRLM) has achieved a very good p ...
Full textCite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · December 1, 2008
In this paper, a new cochannel speech separation algorithm using multi-pitch extraction and speaker model based sequential grouping is proposed. After auditory segmentation based on onset and offset analysis, robust multi-pitch estimation algorithm is perf ...
Cite
ConferenceProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH · December 1, 2008
This paper presents an objective singing quality evaluation approach based on a study of the relationship between acoustic measurements and perceptual ratings of singing voice quality. Individual perceptual criteria's contributions to the overall rating ar ...
Cite
ConferenceInternational Conference on Signal Processing Proceedings, ICSP · December 1, 2008
This paper describes a study of subjective criteria for untrained singers' singing voice quality evaluation, focusing on the perceptual aspects that have relatively strong acoustic implications. And the correlation among the individual perceptual criteria ...
Full textCite
Journal ArticleIEICE Transactions on Information and Systems · January 1, 2008
Robust automatic language identification (LID) is the task of identifying the language from a short utterance spoken by an unknown speaker. The mainstream approaches include parallel phone recognition language modeling (PPRLM), support vector machine (SVM) ...
Full textCite
Journal ArticleIEICE Transactions on Information and Systems · January 1, 2008
In this letter we focus on the task of selecting the melody track from a polyphonic MIDI file. Based on the intuition that music and language are similar in many aspects, we solve the selection problem by introducing an n-gram language model to learn the m ...
Full textCite
ConferenceInternational Speech Communication Association - 8th Annual Conference of the International Speech Communication Association, Interspeech 2007 · December 1, 2007
The support vector machine (SVM) framework based on generalized linear discriminate sequence (GLDS) kernel has been shown effective and widely used in language identification tasks. In this paper, in order to compensate the distortions due to inter-speaker ...
Cite
ConferenceProceedings - 3rd International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIHMSP 2007. · December 1, 2007
A method based on audio watermarking techniques for authentication and monitoring broadcasting quality of existing analog amplitude modulation (AM) shortwave radio is presented. The content and number of extracted messages can be useful to authenticate the ...
Full textCite
ConferenceProceedings of the 8th International Conference on Music Information Retrieval, ISMIR 2007 · December 1, 2007
This paper proposes an effective method for automatic melody extraction in polyphonic music, especially vocal melody songs. The method is based on subharmonic summation spectrum and harmonic structure tracking strategy. Performance of the method is evaluat ...
Cite
ConferenceProceedings - Third International Conference on Natural Computation, ICNC 2007 · December 1, 2007
The design approach for classifying the backend features of the PPRLM (Parallel Phone Recognition and Language Modeling) system is demonstrated in this paper. A variety of features and their combinations extracted by language dependent recognizers were eva ...
Full textCite
ConferenceProceedings - 2006 International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2006 · December 1, 2006
A novel approach is proposed for robust audio watermarking in wavelet domain. It emphasizes on enhancing security by dynamically modifying embedding strategy. The modification is based on real-time changes of the watermark information and host audio. Witho ...
Full textCite
Journal ArticleJournal of Beijing Institute of Technology (English Edition) · September 1, 2006
Iterative demodulation and decoding scheme is analyzed and modulation labeling is considered to be one of the crucial factors to this scheme. By analyzing the existent mapping design criterion, four aspects are found as the key techniques for choosing a la ...
Cite