摘要
Deep learning based speech emotion recognition (SER) models have shown impressive results in controlled environments, but their performance significantly degrades in noisy conditions. This paper proposes a robust two-stream SER model by combining spectro-temporal modulation features with conventional acoustic features. Experiments were conducted on German (EMODB) and English (RAVDESS) datasets using the cleantrain-noisy-test paradigm. The results demonstrate that spectrotemporal modulation features offer superior robustness in noisy conditions compared with conventional acoustic features such as MFCCs and time-frequency features from Mel-spectrograms. Additionally, we analyze weights of modulation features and demonstrate the model emphasizes contours of formants and harmonics, which are crucial features for speech perception in noise, for robust SER. Incorporating the stream of spectrotemporal modulations not only enhances the robustness of the model but also provides deeper insights into the task of SER in noise.
原文 | English |
---|---|
期刊 | IEEE Transactions on Affective Computing |
DOIs | |
出版狀態 | Accepted/In press - 2025 |