TY - GEN
T1 - Anime Character Recognition using Intermediate Features Aggregation
AU - Rios, Edwin Arkel
AU - Hu, Min Chun
AU - Lai, Bo Cheng
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - In this work we study the problem of anime character recognition. Anime, refers to animation produced within Japan and work derived or inspired from it. We propose a novel Intermediate Features Aggregation classification head, which helps smooth the optimization landscape of Vision Transformers (ViTs) by adding skip connections between intermediate layers and the classification head, thereby improving relative classification accuracy by up to 28%. The proposed model, named as Animesion, is the first end-to-end framework for large-scale anime character recognition. We conduct extensive experiments using a variety of classification models, including CNNs and self-attention based ViTs. We also adapt its multimodal variation Vision-Language Transformer (ViLT), to incorporate external tag data for classification, without additional multimodal pre-training. Through our results we obtain new insights into the effects of how hyperparameters such as input sequence length, mini-batch size, and variations on the architecture, affect the transfer learning performance of Vi(L)Ts.
AB - In this work we study the problem of anime character recognition. Anime, refers to animation produced within Japan and work derived or inspired from it. We propose a novel Intermediate Features Aggregation classification head, which helps smooth the optimization landscape of Vision Transformers (ViTs) by adding skip connections between intermediate layers and the classification head, thereby improving relative classification accuracy by up to 28%. The proposed model, named as Animesion, is the first end-to-end framework for large-scale anime character recognition. We conduct extensive experiments using a variety of classification models, including CNNs and self-attention based ViTs. We also adapt its multimodal variation Vision-Language Transformer (ViLT), to incorporate external tag data for classification, without additional multimodal pre-training. Through our results we obtain new insights into the effects of how hyperparameters such as input sequence length, mini-batch size, and variations on the architecture, affect the transfer learning performance of Vi(L)Ts.
UR - http://www.scopus.com/inward/record.url?scp=85142491208&partnerID=8YFLogxK
U2 - 10.1109/ISCAS48785.2022.9937519
DO - 10.1109/ISCAS48785.2022.9937519
M3 - Conference contribution
AN - SCOPUS:85142491208
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
SP - 424
EP - 428
BT - IEEE International Symposium on Circuits and Systems, ISCAS 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE International Symposium on Circuits and Systems, ISCAS 2022
Y2 - 27 May 2022 through 1 June 2022
ER -