Zero-Shot Face-Based Voice Conversion: Bottleneck-Free Speech Disentanglement in the Real-World Scenario

Shao En Weng, Hong Han Shuai, Wen Huang Cheng

研究成果: Conference contribution同行評審

摘要

Often a face has a voice. Appearance sometimes has a strong relationship with one's voice. In this work, we study how a face can be converted to a voice, which is a face-based voice conversion. Since there is no clean dataset that contains face and speech, voice conversion faces difficult learning and low-quality problems caused by background noise or echo. Too much redundant information for face-to-voice also causes synthesis of a general style of speech. Furthermore, previous work tried to disentangle speech with bottleneck adjustment. However, it is hard to decide on the size of the bottleneck. Therefore, we propose a bottleneck-free strategy for speech disentanglement. To avoid synthesizing the general style of speech, we utilize framewise facial embedding. It applied adversarial learning with a multi-scale discriminator for the model to achieve better quality. In addition, the self-attention module is added to focus on content-related features for in-the-wild data. Quantitative experiments show that our method outperforms previous work.

原文English
主出版物標題AAAI-23 Technical Tracks 11
編輯Brian Williams, Yiling Chen, Jennifer Neville
發行者AAAI press
頁面13718-13726
頁數9
ISBN(電子)9781577358800
出版狀態Published - 27 6月 2023
事件37th AAAI Conference on Artificial Intelligence, AAAI 2023 - Washington, United States
持續時間: 7 2月 202314 2月 2023

出版系列

名字Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023
37

Conference

Conference37th AAAI Conference on Artificial Intelligence, AAAI 2023
國家/地區United States
城市Washington
期間7/02/2314/02/23

指紋

深入研究「Zero-Shot Face-Based Voice Conversion: Bottleneck-Free Speech Disentanglement in the Real-World Scenario」主題。共同形成了獨特的指紋。

引用此