A Preliminary Study on Taiwanese OCR for Assisting Textual Database Construction from Historical Documents

Yuan Fu Liao*, Yu Hsuan Huang, Matus Pleva, Daniel Hladek, Ming Hsiang Su

*此作品的通信作者

研究成果: Conference contribution同行評審

摘要

Currently, there is not enough Taiwanese text available to build a proper language model (LM) to support the construction of emerging Taiwanese automatic speech recognition (ASR) and text-to-speech (TTS) systems. Therefore, this paper reports the first Taiwanese optical character recognition (OCR) [1, 2, 3] system to assist human annotators in converting a vast collection of scanned images of Taiwanese historical documents preserved in the 'Memory of the Written Taiwanese' (MoWT) website [4] into a usable textual database for building state-of-the-art Taiwanese ASR and TTS systems in the future. Supplementary information and replication materials are available on GitHub [5].

原文English
主出版物標題2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022
編輯Kong Aik Lee, Hung-yi Lee, Yanfeng Lu, Minghui Dong
發行者Institute of Electrical and Electronics Engineers Inc.
頁面270-274
頁數5
ISBN(電子)9798350397963
DOIs
出版狀態Published - 2022
事件13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 - Singapore, 新加坡
持續時間: 11 12月 202214 12月 2022

出版系列

名字2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

Conference

Conference13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022
國家/地區新加坡
城市Singapore
期間11/12/2214/12/22

指紋

深入研究「A Preliminary Study on Taiwanese OCR for Assisting Textual Database Construction from Historical Documents」主題。共同形成了獨特的指紋。

引用此