An NN-based approach to prosodic information generation for synthesizing English words embedded in Chinese text

Wei Chih Kuo, Li Feng Lin, Yih-Ru Wang, Sin-Horng Chen*

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

1 Scopus citations

Abstract

In this paper, a neural network-based approach to generating proper prosodic information for spelling/reading English words embedded in background Chinese texts is discussed. It expands an existing RNN-based prosodic information generator for Mandarin TTS to an RNN-MLP scheme for Mandarin-English mixed-lingual TTS. It first treats each English word as a Chinese word and uses the RNN, trained for Mandarin TTS, to generate a set of initial prosodic information for each syllable of the English word. It then refines the initial prosodic information by using additional MLPs. The resulting prosodic information is expected to be appropriate for English-word synthesis as well as to match well with that of the background Mandarin speech. Experimental results showed that the proposed RNN-MLP scheme performed very well. For English word spelling/reading, RMSEs of 41.8/78.2 ms, 30.8/26 ms, 0.65/0.45 ms/frame, and 3.06/4.9 dB were achieved in the open tests for the synthesized syllable duration, inter-syllable pause duration, pitch contour, and energy level, respectively. So it is a promising approach.

Original languageEnglish
Pages3109-3112
Number of pages4
StatePublished - Sep 2003
Event8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - Geneva, Switzerland
Duration: 1 Sep 20034 Sep 2003

Conference

Conference8th European Conference on Speech Communication and Technology, EUROSPEECH 2003
Country/TerritorySwitzerland
CityGeneva
Period1/09/034/09/03

Fingerprint

Dive into the research topics of 'An NN-based approach to prosodic information generation for synthesizing English words embedded in Chinese text'. Together they form a unique fingerprint.

Cite this