TY - GEN
T1 - MENTOR
T2 - 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2023
AU - Lin, Hsin Ju
AU - Chung, Tsu Chun
AU - Hsiao, Ching Chun
AU - Chen, Pin Yu
AU - Chiu, Wei Chen
AU - Huang, Ching Chun
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Text detection is frequently used in vision-based mobile robots when they need to interpret texts in their surroundings to perform a given task. For instance, delivery robots in multilingual cities need to be capable of doing multilingual text detection so that the robots can read traffic signs and road markings. Moreover, the target languages change from region to region, implying the need of efficiently re-training the models to recognize the novel/new languages. However, collecting and labeling training data for novel languages are cumbersome, and the efforts to re-train an existing/trained text detector are considerable. Even worse, such a routine would repeat whenever a novel language appears. This motivates us to propose a new problem setting for tackling the aforementioned challenges in a more efficient way: 'We ask for a generalizable multilingual text detection framework to detect and identify both seen and unseen language regions inside scene images without the requirement of collecting supervised training data for unseen languages as well as model re-training'. To this end, we propose 'MENTOR', the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection. During the training phase, we leverage the 'zero-cost' synthesized printed texts and the available training/seen languages to learn the meta-mapping from printed texts to language-specific kernel weights. Meanwhile, dynamic convolution networks guided by the language-specific kernel are trained to realize a detection-by-feature-matching scheme. In the inference phase, 'zero-cost' printed texts are synthesized given a new target language. By utilizing the learned meta-mapping and the matching network, our 'MENTOR' can freely identify the text regions of the new language. Experiments show our model can achieve comparable results with supervised methods for seen languages and outperform other methods in detecting unseen languages.
AB - Text detection is frequently used in vision-based mobile robots when they need to interpret texts in their surroundings to perform a given task. For instance, delivery robots in multilingual cities need to be capable of doing multilingual text detection so that the robots can read traffic signs and road markings. Moreover, the target languages change from region to region, implying the need of efficiently re-training the models to recognize the novel/new languages. However, collecting and labeling training data for novel languages are cumbersome, and the efforts to re-train an existing/trained text detector are considerable. Even worse, such a routine would repeat whenever a novel language appears. This motivates us to propose a new problem setting for tackling the aforementioned challenges in a more efficient way: 'We ask for a generalizable multilingual text detection framework to detect and identify both seen and unseen language regions inside scene images without the requirement of collecting supervised training data for unseen languages as well as model re-training'. To this end, we propose 'MENTOR', the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection. During the training phase, we leverage the 'zero-cost' synthesized printed texts and the available training/seen languages to learn the meta-mapping from printed texts to language-specific kernel weights. Meanwhile, dynamic convolution networks guided by the language-specific kernel are trained to realize a detection-by-feature-matching scheme. In the inference phase, 'zero-cost' printed texts are synthesized given a new target language. By utilizing the learned meta-mapping and the matching network, our 'MENTOR' can freely identify the text regions of the new language. Experiments show our model can achieve comparable results with supervised methods for seen languages and outperform other methods in detecting unseen languages.
UR - http://www.scopus.com/inward/record.url?scp=85182523358&partnerID=8YFLogxK
U2 - 10.1109/IROS55552.2023.10342419
DO - 10.1109/IROS55552.2023.10342419
M3 - Conference contribution
AN - SCOPUS:85182523358
T3 - IEEE International Conference on Intelligent Robots and Systems
SP - 3248
EP - 3255
BT - 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 1 October 2023 through 5 October 2023
ER -