TY - GEN
T1 - Speech Reconstruction from the Larynx Vibration Feature Captured by Laser-Doppler Vibrometer Sensor
AU - Lin, Yi Chieh
AU - Han, Ji Yan
AU - Lin, Yu Min
AU - Zheng, Wei Zhong
AU - Young, Shuenn Tsong
AU - Lai, Ying Hui
N1 - Publisher Copyright:
© 2021 APSIPA.
PY - 2021
Y1 - 2021
N2 - There are many deep learning (DL)-based models with the contact sensors (e.g., throat microphone, TM) to reconstruct the speech from the vibration signals of the larynx. The TM can obtain robust speech information than an air-conducted microphone (ACM) sensor in noisy environments. However, it needs tight contact with the user's skin, which causes discomfort for users. Therefore, we assume that a non-contact sensor allows users to have a better experience. Following this concept, the DL-based models with a non-contact sensor, a laser-Doppler vibrometer (LDV), are proposed to reconstruct the speech from the vibration signals of the larynx. Notably, the recognition and speech synthesis modules were adopted in the proposed system. The experimental results showed that, on average, the word error rate (WER) of the recognition module in the proposed system achieves similar performance as TM did in both quiet and noisy testing conditions. Furthermore, the listening test showed that the synthesis module's reconstructed speech provided a higher preference rate and naturalness than an original recorded speech of the LDV sensor. These results suggested that the proposed system is a potential approach to reconstruct speech from the vibration signals of the larynx with DL technology, captured by a non-contact LDV sensor.
AB - There are many deep learning (DL)-based models with the contact sensors (e.g., throat microphone, TM) to reconstruct the speech from the vibration signals of the larynx. The TM can obtain robust speech information than an air-conducted microphone (ACM) sensor in noisy environments. However, it needs tight contact with the user's skin, which causes discomfort for users. Therefore, we assume that a non-contact sensor allows users to have a better experience. Following this concept, the DL-based models with a non-contact sensor, a laser-Doppler vibrometer (LDV), are proposed to reconstruct the speech from the vibration signals of the larynx. Notably, the recognition and speech synthesis modules were adopted in the proposed system. The experimental results showed that, on average, the word error rate (WER) of the recognition module in the proposed system achieves similar performance as TM did in both quiet and noisy testing conditions. Furthermore, the listening test showed that the synthesis module's reconstructed speech provided a higher preference rate and naturalness than an original recorded speech of the LDV sensor. These results suggested that the proposed system is a potential approach to reconstruct speech from the vibration signals of the larynx with DL technology, captured by a non-contact LDV sensor.
UR - http://www.scopus.com/inward/record.url?scp=85126702404&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85126702404
T3 - 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings
SP - 829
EP - 835
BT - 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021
Y2 - 14 December 2021 through 17 December 2021
ER -