TY - JOUR
T1 - Prosody modeling of spontaneous mandarin speech and its application to automatic speech recognition
AU - Lin, Cheng Hsien
AU - Wu, Meng Chian
AU - You, Chung Long
AU - Chiang, Chen Yu
AU - Wang, Yih-Ru
AU - Chen, Sin-Horng
N1 - Publisher Copyright:
© 2016, International Speech Communications Association. All rights reserved.
PY - 2016
Y1 - 2016
N2 - A prosody-assisted ASR approach for spontaneous Mandarin speech is proposed. It employs the joint prosody labeling and modeling algorithm proposed previously to construct a hierarchical prosodic model (HPM) and uses it in two-stage speech recognition. A word lattice is first generated by the HMM method using tri-phone AM and bigram LM. Then, the lattice is extended by replacing LM to a trigram model. A rescoring process is applied in the second stage to sequentially add factor POS and PM LMs, and the HPM. The method is evaluated on the MCDC database comprising 8 dialogues of 16 speakers with length of 9.09 hours. Error rates of syllable/character/word were reduced from 35.6/40.2/45.1% by the baseline trigram HMM method to 32.4/36.5/41.8% by the proposed method. The improvement is reasonably good as considering the WER upper-bound of 13.4% for the word lattice owing to the high OOV rate of the database. By error analysis, we find that many tone recognition errors and word segmentation errors were corrected. Besides, some information of the testing utterance was also obtained by the ASR, including POS of word, PM, tone of syllable, break type of syllable juncture, and prosodic state of syllable.
AB - A prosody-assisted ASR approach for spontaneous Mandarin speech is proposed. It employs the joint prosody labeling and modeling algorithm proposed previously to construct a hierarchical prosodic model (HPM) and uses it in two-stage speech recognition. A word lattice is first generated by the HMM method using tri-phone AM and bigram LM. Then, the lattice is extended by replacing LM to a trigram model. A rescoring process is applied in the second stage to sequentially add factor POS and PM LMs, and the HPM. The method is evaluated on the MCDC database comprising 8 dialogues of 16 speakers with length of 9.09 hours. Error rates of syllable/character/word were reduced from 35.6/40.2/45.1% by the baseline trigram HMM method to 32.4/36.5/41.8% by the proposed method. The improvement is reasonably good as considering the WER upper-bound of 13.4% for the word lattice owing to the high OOV rate of the database. By error analysis, we find that many tone recognition errors and word segmentation errors were corrected. Besides, some information of the testing utterance was also obtained by the ASR, including POS of word, PM, tone of syllable, break type of syllable juncture, and prosodic state of syllable.
KW - Prosody modeling
KW - Prosody-assisted ASR
KW - Spontaneous Mandarin speech
UR - http://www.scopus.com/inward/record.url?scp=84982890648&partnerID=8YFLogxK
U2 - 10.21437/SpeechProsody.2016-212
DO - 10.21437/SpeechProsody.2016-212
M3 - Conference article
AN - SCOPUS:84982890648
SN - 2333-2042
VL - 2016-January
SP - 1034
EP - 1037
JO - Proceedings of the International Conference on Speech Prosody
JF - Proceedings of the International Conference on Speech Prosody
T2 - 8th Speech Prosody 2016
Y2 - 31 May 2016 through 3 June 2016
ER -