TY - GEN
T1 - Learning Contrastive Emotional Nuances in Speech Synthesis
AU - Ngo, Bryan Gautama
AU - Rohmatillah, Mahdin
AU - Chien, Jen Tzung
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Prosody is a crucial speech feature in emotional text - to-speech (TTS), as different emotions have distinct prosodic characteristics. Existing works in emotional TTS have primarily utilized emotion labels in the dataset by applying auxiliary emotion classification loss to enhance emotional nuances in the model. However, this approach may only partially leverage the potential of emotion labels. Accordingly, this paper proposes a supervised contrastive approach to effectively utilize emotion labels and enable the model to distinguish prosody from different emotions. Furthermore, this work also explores the unsupervised contrastive learning where the emotion labels are missing in emotional TTS. In particular, the proposed TTS architecture assures a cross-speaker emotion in transfer learning, allowing for an accurate speech generation even without specific prosody from a target speaker. The experimental results on emotional datasets demonstrate the effectiveness of the proposed method.
AB - Prosody is a crucial speech feature in emotional text - to-speech (TTS), as different emotions have distinct prosodic characteristics. Existing works in emotional TTS have primarily utilized emotion labels in the dataset by applying auxiliary emotion classification loss to enhance emotional nuances in the model. However, this approach may only partially leverage the potential of emotion labels. Accordingly, this paper proposes a supervised contrastive approach to effectively utilize emotion labels and enable the model to distinguish prosody from different emotions. Furthermore, this work also explores the unsupervised contrastive learning where the emotion labels are missing in emotional TTS. In particular, the proposed TTS architecture assures a cross-speaker emotion in transfer learning, allowing for an accurate speech generation even without specific prosody from a target speaker. The experimental results on emotional datasets demonstrate the effectiveness of the proposed method.
KW - Emotional text-to-speech
KW - contrastive learning
KW - cross-speaker speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=85215705264&partnerID=8YFLogxK
U2 - 10.1109/O-COCOSDA64382.2024.10800372
DO - 10.1109/O-COCOSDA64382.2024.10800372
M3 - Conference contribution
AN - SCOPUS:85215705264
T3 - 2024 27th Conference on the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques, O-COCOSDA 2024 - Proceedings
BT - 2024 27th Conference on the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques, O-COCOSDA 2024 - Proceedings
A2 - Su, Ming-Hsiang
A2 - Yeh, Jui-Feng
A2 - Liao, Yuan-Fu
A2 - Lee, Chi-Chun
A2 - Taso, Yu
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 27th Conference on the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques, O-COCOSDA 2024
Y2 - 17 October 2024 through 19 October 2024
ER -