TY - JOUR
T1 - Consonant classification in Mandarin based on the depth image feature
T2 - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019
AU - Hsieh, Han Chi
AU - Zheng, Wei Zhong
AU - Chen, Ko Chiang
AU - Lai, Ying Hui
N1 - Publisher Copyright:
Copyright © 2019 ISCA
PY - 2019
Y1 - 2019
N2 - The consonant is an important element in Mandarin, and various categories of consonant generation effectuate various facial expressions. Specifically, there are changes in facial muscles when speaking, and these changes are closely related to pronunciation; the facial muscles are associated with these hidden articulators, and the effects on the facial changes can be seen as 3D changes. However, in most studies, 2D images are used to analyze facial features when people talk. The 2D images serve to provide information in two dimensions (x- and y-axis); however, subtle deep motions (z-axis changes) of facial muscles when speaking can be difficult to detect accurately. Hence, the depth feature of the face (the point cloud feature in this study) was used to investigate the potential for consonant recognition, recorded by a time-of-flight 3D camera. In this study, we propose an algorithm to recognize the seven categories of Mandarin consonants using the depth features of the speaker's face. The proposed system yielded suitable classification accuracy for the recognition of seven categories of Mandarin consonants. This result implies that depth features can be used for speech-processing applications.
AB - The consonant is an important element in Mandarin, and various categories of consonant generation effectuate various facial expressions. Specifically, there are changes in facial muscles when speaking, and these changes are closely related to pronunciation; the facial muscles are associated with these hidden articulators, and the effects on the facial changes can be seen as 3D changes. However, in most studies, 2D images are used to analyze facial features when people talk. The 2D images serve to provide information in two dimensions (x- and y-axis); however, subtle deep motions (z-axis changes) of facial muscles when speaking can be difficult to detect accurately. Hence, the depth feature of the face (the point cloud feature in this study) was used to investigate the potential for consonant recognition, recorded by a time-of-flight 3D camera. In this study, we propose an algorithm to recognize the seven categories of Mandarin consonants using the depth features of the speaker's face. The proposed system yielded suitable classification accuracy for the recognition of seven categories of Mandarin consonants. This result implies that depth features can be used for speech-processing applications.
KW - Consonant classification
KW - Deep learning
KW - Depth image
KW - Point cloud
UR - http://www.scopus.com/inward/record.url?scp=85074712577&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2019-1893
DO - 10.21437/Interspeech.2019-1893
M3 - Conference article
AN - SCOPUS:85074712577
SN - 2308-457X
VL - 2019-September
SP - 2300
EP - 2304
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Y2 - 15 September 2019 through 19 September 2019
ER -