TY - JOUR
T1 - Modality Translation Learning for Joint Speech-Text Model
AU - Liu, Pin Yen
AU - Chien, Jen Tzung
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Recent research on speech models, which are jointly pre-trained with text, has unveiled its promising potential to enhance speech representations by encoding both speech and text within a shared space. However, these models often struggle with the interference between speech and text modalities that hardly achieves cross-modality alignment. Furthermore, the previous focus of evaluation for these models has been on neutral speech scenarios. Their effectiveness in addressing domain-shift speech, notably in the context of emotional speech, has remained largely unexplored in the existing works. In this study, a modality translation model is proposed to align speech and text modalities based on a shared space for speech-to-text translation, and aims to harness such a shared representation to address the challenge of emotional speech recognition. Experiment results show that the proposed method achieves about 3% absolute improvement in word error rate when compared with speech models.
AB - Recent research on speech models, which are jointly pre-trained with text, has unveiled its promising potential to enhance speech representations by encoding both speech and text within a shared space. However, these models often struggle with the interference between speech and text modalities that hardly achieves cross-modality alignment. Furthermore, the previous focus of evaluation for these models has been on neutral speech scenarios. Their effectiveness in addressing domain-shift speech, notably in the context of emotional speech, has remained largely unexplored in the existing works. In this study, a modality translation model is proposed to align speech and text modalities based on a shared space for speech-to-text translation, and aims to harness such a shared representation to address the challenge of emotional speech recognition. Experiment results show that the proposed method achieves about 3% absolute improvement in word error rate when compared with speech models.
KW - cross-modality representation
KW - speech recognition
KW - speech representations
KW - speech-text pre-training
UR - http://www.scopus.com/inward/record.url?scp=85210682359&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2024-675
DO - 10.21437/Interspeech.2024-675
M3 - Conference article
AN - SCOPUS:85210682359
SN - 2308-457X
SP - 772
EP - 776
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 25th Interspeech Conferece 2024
Y2 - 1 September 2024 through 5 September 2024
ER -