TY - GEN
T1 - Face-based Voice Conversion
T2 - 29th ACM International Conference on Multimedia, MM 2021
AU - Lu, Hsiao Han
AU - Weng, Shao En
AU - Yen, Ya Fan
AU - Shuai, Hong-Han
AU - Cheng, Wen-Huang
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/10/17
Y1 - 2021/10/17
N2 - Zero-shot voice conversion (VC) trained by non-parallel data has gained a lot of attention in recent years. Previous methods usually extract speaker embeddings from audios and use them for converting the voices into different voice styles. Since there is a strong relationship between human faces and voices, a promising approach would be to synthesize various voice characteristics from face representation. Therefore, we introduce a novel idea of generating different voice styles from different human face photos, which can facilitate new applications, e.g., personalized voice assistants. However, the audio-visual relationship is implicit. Moreover, the existing VCs are trained on laboratory-collected datasets without speaker photos, while the datasets with both photos and audios are in-the-wild datasets. Directly replacing the target audio with the target photo and training on the in-the-wild dataset leads to noisy results. To address these issues, we propose a novel many-to-many voice conversion network, namely Face-based Voice Conversion (FaceVC), with a 3-stage training strategy. Quantitative and qualitative experiments on the LRS3-Ted dataset show that the proposed FaceVC successfully performs voice conversion according to the target face photos. Audio samples can be found on the demo website at https://facevc.github.io/.
AB - Zero-shot voice conversion (VC) trained by non-parallel data has gained a lot of attention in recent years. Previous methods usually extract speaker embeddings from audios and use them for converting the voices into different voice styles. Since there is a strong relationship between human faces and voices, a promising approach would be to synthesize various voice characteristics from face representation. Therefore, we introduce a novel idea of generating different voice styles from different human face photos, which can facilitate new applications, e.g., personalized voice assistants. However, the audio-visual relationship is implicit. Moreover, the existing VCs are trained on laboratory-collected datasets without speaker photos, while the datasets with both photos and audios are in-the-wild datasets. Directly replacing the target audio with the target photo and training on the in-the-wild dataset leads to noisy results. To address these issues, we propose a novel many-to-many voice conversion network, namely Face-based Voice Conversion (FaceVC), with a 3-stage training strategy. Quantitative and qualitative experiments on the LRS3-Ted dataset show that the proposed FaceVC successfully performs voice conversion according to the target face photos. Audio samples can be found on the demo website at https://facevc.github.io/.
KW - face-voice relationship
KW - visual-audio generation
KW - voice conversion
UR - http://www.scopus.com/inward/record.url?scp=85119353981&partnerID=8YFLogxK
U2 - 10.1145/3474085.3475198
DO - 10.1145/3474085.3475198
M3 - Conference contribution
AN - SCOPUS:85119353981
T3 - MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
SP - 496
EP - 505
BT - MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 20 October 2021 through 24 October 2021
ER -