TY - JOUR
T1 - Comparing the performance of classic voice-driven assistive systems for dysarthric speech
AU - Zheng, Wei Zhong
AU - Han, Ji Yan
AU - Cheng, Hsiu Lien
AU - Chu, Wei Chung
AU - Chen, Ko Chiang
AU - Lai, Ying Hui
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2023/3
Y1 - 2023/3
N2 - Voice-driven communication assistive systems—speech enhancement (SE), voice conversion (VC), and automatic speech recognition with text-to-speech (ASR-TTS)—are recognized approaches for improving dysarthric speakers’ speech intelligibility. However, which approach performs better for moderate dysarthric patients is unclear. This study compared the benefits of three classic difference-type voice-driven assistive systems for dysarthric patients under identical test conditions. The benefits of the three systems for dysarthric patients’ speech intelligibility were compared; 14 mild-to-severe dysarthric patients and five speakers with normal speech were invited to record the training sets for these systems. Five moderate dysarthric patients were selected to record two additional testing sets, which were used for evaluating the systems’ benefits. Google Automatic Speech Recognition's (Google ASR) evaluation metrics and listening tests verified each system's speech intelligibility and quality. The speech intelligibility results produced by Google ASR were 7.0%, 22.9%, and 93.8% for the SE, VC, and ASR-TTS systems, respectively. Regarding the listening test, the performance of speech intelligibility and quality were 38.7%, 40.5%, 95.5%, and 1.81, 2.18, 4.56 for SE, VC, and ASR-TTS systems, respectively. The ASR-TTS system performed better than SE and VC. Furthermore, t-distributed stochastic neighbor embedding (t-SNE) analysis was used to additionally compare the differences between the systems. The t-SNE analysis results indicated that ASR-TTS’ phonetic posteriorgram features provided stable performance compared with the other speech features (log-power spectrum and spectra) in the SE and VC systems. Results showed that the ASR-TTS is a potential system to improve moderate dysarthric patients’ speech intelligibility and quality in future applications.
AB - Voice-driven communication assistive systems—speech enhancement (SE), voice conversion (VC), and automatic speech recognition with text-to-speech (ASR-TTS)—are recognized approaches for improving dysarthric speakers’ speech intelligibility. However, which approach performs better for moderate dysarthric patients is unclear. This study compared the benefits of three classic difference-type voice-driven assistive systems for dysarthric patients under identical test conditions. The benefits of the three systems for dysarthric patients’ speech intelligibility were compared; 14 mild-to-severe dysarthric patients and five speakers with normal speech were invited to record the training sets for these systems. Five moderate dysarthric patients were selected to record two additional testing sets, which were used for evaluating the systems’ benefits. Google Automatic Speech Recognition's (Google ASR) evaluation metrics and listening tests verified each system's speech intelligibility and quality. The speech intelligibility results produced by Google ASR were 7.0%, 22.9%, and 93.8% for the SE, VC, and ASR-TTS systems, respectively. Regarding the listening test, the performance of speech intelligibility and quality were 38.7%, 40.5%, 95.5%, and 1.81, 2.18, 4.56 for SE, VC, and ASR-TTS systems, respectively. The ASR-TTS system performed better than SE and VC. Furthermore, t-distributed stochastic neighbor embedding (t-SNE) analysis was used to additionally compare the differences between the systems. The t-SNE analysis results indicated that ASR-TTS’ phonetic posteriorgram features provided stable performance compared with the other speech features (log-power spectrum and spectra) in the SE and VC systems. Results showed that the ASR-TTS is a potential system to improve moderate dysarthric patients’ speech intelligibility and quality in future applications.
KW - Deep learning
KW - Dysarthria
KW - Speech intelligibility
KW - Voice-driven assistive
UR - http://www.scopus.com/inward/record.url?scp=85144088019&partnerID=8YFLogxK
U2 - 10.1016/j.bspc.2022.104447
DO - 10.1016/j.bspc.2022.104447
M3 - Article
AN - SCOPUS:85144088019
SN - 1746-8094
VL - 81
JO - Biomedical Signal Processing and Control
JF - Biomedical Signal Processing and Control
M1 - 104447
ER -