Face-based Voice Conversion: Learning the Voice behind a Face

Hsiao Han Lu, Shao En Weng, Ya Fan Yen, Hong-Han Shuai, Wen-Huang Cheng

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Zero-shot voice conversion (VC) trained by non-parallel data has gained a lot of attention in recent years. Previous methods usually extract speaker embeddings from audios and use them for converting the voices into different voice styles. Since there is a strong relationship between human faces and voices, a promising approach would be to synthesize various voice characteristics from face representation. Therefore, we introduce a novel idea of generating different voice styles from different human face photos, which can facilitate new applications, e.g., personalized voice assistants. However, the audio-visual relationship is implicit. Moreover, the existing VCs are trained on laboratory-collected datasets without speaker photos, while the datasets with both photos and audios are in-the-wild datasets. Directly replacing the target audio with the target photo and training on the in-the-wild dataset leads to noisy results. To address these issues, we propose a novel many-to-many voice conversion network, namely Face-based Voice Conversion (FaceVC), with a 3-stage training strategy. Quantitative and qualitative experiments on the LRS3-Ted dataset show that the proposed FaceVC successfully performs voice conversion according to the target face photos. Audio samples can be found on the demo website at https://facevc.github.io/.

Original languageEnglish
Title of host publicationMM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages496-505
Number of pages10
ISBN (Electronic)9781450386517
DOIs
StatePublished - 17 Oct 2021
Event29th ACM International Conference on Multimedia, MM 2021 - Virtual, Online, China
Duration: 20 Oct 202124 Oct 2021

Publication series

NameMM 2021 - Proceedings of the 29th ACM International Conference on Multimedia

Conference

Conference29th ACM International Conference on Multimedia, MM 2021
Country/TerritoryChina
CityVirtual, Online
Period20/10/2124/10/21

Keywords

  • face-voice relationship
  • visual-audio generation
  • voice conversion

Fingerprint

Dive into the research topics of 'Face-based Voice Conversion: Learning the Voice behind a Face'. Together they form a unique fingerprint.

Cite this