Modality Translation Learning for Joint Speech-Text Model

Pin Yen Liu, Jen Tzung Chien

Research output: Contribution to journalConference articlepeer-review

3 Scopus citations

Abstract

Recent research on speech models, which are jointly pre-trained with text, has unveiled its promising potential to enhance speech representations by encoding both speech and text within a shared space. However, these models often struggle with the interference between speech and text modalities that hardly achieves cross-modality alignment. Furthermore, the previous focus of evaluation for these models has been on neutral speech scenarios. Their effectiveness in addressing domain-shift speech, notably in the context of emotional speech, has remained largely unexplored in the existing works. In this study, a modality translation model is proposed to align speech and text modalities based on a shared space for speech-to-text translation, and aims to harness such a shared representation to address the challenge of emotional speech recognition. Experiment results show that the proposed method achieves about 3% absolute improvement in word error rate when compared with speech models.

Original languageEnglish
Pages (from-to)772-776
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2024
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: 1 Sep 20245 Sep 2024

Keywords

  • cross-modality representation
  • speech recognition
  • speech representations
  • speech-text pre-training

Fingerprint

Dive into the research topics of 'Modality Translation Learning for Joint Speech-Text Model'. Together they form a unique fingerprint.

Cite this