Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion

Yung Lun Chien, Hsin Hao Chen, Ming Chi Yen, Shu Wei Tsai, Hsin Min Wang, Yu Tsao, Tai Shih Chi

Research output: Contribution to journalConference articlepeer-review

Abstract

Electrolarynx is a commonly used assistive device to help patients with removed vocal cords regain their ability to speak. Although the electrolarynx can generate excitation signals like the vocal cords, the naturalness and intelligibility of electrolaryngeal (EL) speech are very different from those of natural (NL) speech. Many deep-learning-based models have been applied to electrolaryngeal speech voice conversion (ELVC) for converting EL speech to NL speech. In this study, we propose a multimodal voice conversion (VC) model that integrates acoustic and visual information into a unified network. We compared different pre-trained models as visual feature extractors and evaluated the effectiveness of these features in the ELVC task. The experimental results demonstrate that the proposed multimodal VC model outperforms single-modal models in both objective and subjective metrics, suggesting that the integration of visual information can significantly improve the quality of ELVC.

Original languageEnglish
Pages (from-to)5023-5026
Number of pages4
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
DOIs
StatePublished - 2023
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Keywords

  • Electrolaryngeal speech
  • feature extractor
  • lip images
  • multimodal learning
  • voice conversion

Fingerprint

Dive into the research topics of 'Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion'. Together they form a unique fingerprint.

Cite this