TY - GEN
T1 - A Deep Learning Based Approach to Synthesize Intelligible Speech with Limited Temporal Envelope Information
AU - Hsiao, Ching Ju
AU - Chen, Fei
AU - Han, Ji Yan
AU - Zheng, Wei Zhong
AU - Lai, Ying Hui
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Envelope waveforms can be extracted from multiple frequency bands of a speech signal, and envelope waveforms carry important intelligibility information for human speech communication. This study aimed to investigate whether a deep learning-based model with features of temporal envelope information could synthesize an intelligible speech, and to study the effect of reducing the number (from 8 to 2 in this work) of temporal envelope information on the intelligibility of the synthesized speech. The objective evaluation metric of short-time objective intelligibility (STOI) showed that, on average, the synthesized speech of the proposed approach provided higher STOI (i.e., 0.8) scores in each test condition; and the human listening test showed that the average word correct rate of eight listeners was higher than 97.5%. These findings indicated that the proposed deep learning-based system can be a potential approach to synthesize a highly intelligible speech with limited envelope information in the future.
AB - Envelope waveforms can be extracted from multiple frequency bands of a speech signal, and envelope waveforms carry important intelligibility information for human speech communication. This study aimed to investigate whether a deep learning-based model with features of temporal envelope information could synthesize an intelligible speech, and to study the effect of reducing the number (from 8 to 2 in this work) of temporal envelope information on the intelligibility of the synthesized speech. The objective evaluation metric of short-time objective intelligibility (STOI) showed that, on average, the synthesized speech of the proposed approach provided higher STOI (i.e., 0.8) scores in each test condition; and the human listening test showed that the average word correct rate of eight listeners was higher than 97.5%. These findings indicated that the proposed deep learning-based system can be a potential approach to synthesize a highly intelligible speech with limited envelope information in the future.
UR - http://www.scopus.com/inward/record.url?scp=85138127551&partnerID=8YFLogxK
U2 - 10.1109/EMBC48229.2022.9871247
DO - 10.1109/EMBC48229.2022.9871247
M3 - Conference contribution
C2 - 36086160
AN - SCOPUS:85138127551
T3 - Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS
SP - 1972
EP - 1976
BT - 44th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2022
Y2 - 11 July 2022 through 15 July 2022
ER -