Spectro-Temporal Modulations Incorporated Two-Stream Robust Speech Emotion Recognition

Yih Liang Shen*, Pei Chin Hsieh, Tai Shih Chi

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Deep learning based speech emotion recognition (SER) models have shown impressive results in controlled environments, but their performance significantly degrades in noisy conditions. This paper proposes a robust two-stream SER model by combining spectro-temporal modulation features with conventional acoustic features. Experiments were conducted on German (EMODB) and English (RAVDESS) datasets using the cleantrain-noisy-test paradigm. The results demonstrate that spectrotemporal modulation features offer superior robustness in noisy conditions compared with conventional acoustic features such as MFCCs and time-frequency features from Mel-spectrograms. Additionally, we analyze weights of modulation features and demonstrate the model emphasizes contours of formants and harmonics, which are crucial features for speech perception in noise, for robust SER. Incorporating the stream of spectrotemporal modulations not only enhances the robustness of the model but also provides deeper insights into the task of SER in noise.

Original languageEnglish
JournalIEEE Transactions on Affective Computing
DOIs
StateAccepted/In press - 2025

Keywords

  • Speech emotion recognition
  • auditory model
  • spectral-temporal modulation

Fingerprint

Dive into the research topics of 'Spectro-Temporal Modulations Incorporated Two-Stream Robust Speech Emotion Recognition'. Together they form a unique fingerprint.

Cite this