A Preliminary Study on Taiwanese OCR for Assisting Textual Database Construction from Historical Documents

Yuan Fu Liao*, Yu Hsuan Huang, Matus Pleva, Daniel Hladek, Ming Hsiang Su

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Currently, there is not enough Taiwanese text available to build a proper language model (LM) to support the construction of emerging Taiwanese automatic speech recognition (ASR) and text-to-speech (TTS) systems. Therefore, this paper reports the first Taiwanese optical character recognition (OCR) [1, 2, 3] system to assist human annotators in converting a vast collection of scanned images of Taiwanese historical documents preserved in the 'Memory of the Written Taiwanese' (MoWT) website [4] into a usable textual database for building state-of-the-art Taiwanese ASR and TTS systems in the future. Supplementary information and replication materials are available on GitHub [5].

Original languageEnglish
Title of host publication2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022
EditorsKong Aik Lee, Hung-yi Lee, Yanfeng Lu, Minghui Dong
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages270-274
Number of pages5
ISBN (Electronic)9798350397963
DOIs
StatePublished - 2022
Event13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 - Singapore, Singapore
Duration: 11 Dec 202214 Dec 2022

Publication series

Name2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

Conference

Conference13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022
Country/TerritorySingapore
CitySingapore
Period11/12/2214/12/22

Keywords

  • Optical Character Recognition
  • Taiwanese Text Corpus
  • Written Taiwanese

Fingerprint

Dive into the research topics of 'A Preliminary Study on Taiwanese OCR for Assisting Textual Database Construction from Historical Documents'. Together they form a unique fingerprint.

Cite this