Improving Spoofing Capability for End-to-end Any-to-many Voice Conversion
Audio deep synthesis techniques have been able to generate highquality speech whose authenticity is difficult for humans to recognize. Meanwhile, many anti-spoofing systems have been developed to capture artifacts in the synthesized speech that are imperceptible to human hearing, thus a continuous escalating race of 'attacking and defending' in voice deepfake has started. Hence, to further improve the probability of successfully cheating anti-spoofing systems, we propose a fully end-to-end, any-to-many voice conversion method based on a non-autoregressive structure with the addition of two light but strong post-processing strategies namely silence replacement and global noise perturbation. Experimental results show that the proposed method performs better than current baselines in fooling several state-of-the-art anti-spoofing systems. Better naturalness and speaker similarity are also achieved, resulting in our proposed method showing high deception performance against humans.