TY - GEN
T1 - Hardware Accelerator for MobileViT Vision Transformer with Reconfigurable Computation
AU - Hsiao, Shen Fu
AU - Chao, Tzu Hsien
AU - Yuan, Yen Che
AU - Chen, Kun Chih
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - With the great success of the Transformer model in Natural Language Processing (NLP), Vision Transformer (ViT) was proposed achieving comparable performance to traditional Convolutional Neural Network (CNN) models in tasks such as image classification and object detection. This paper focuses on the acceleration of a new lightweight hybrid model, named MobileViT, which has less computation complexity and higher accuracy compared with ViT and other CNN-based lightweight models such as MobileNets. We introduce an adaptive systolic array (SA) design with a flexible shape size, called LEGO SA, that enhances the efficiency of hardware utilization and memory accesses during standard convolution, Depth-wise Separable Convolution (DWC), and self-attention operations. Furthermore, matrix transpose in self-attention is implemented efficiently with significantly reduced wastage of execution time, memory buffers, and power consumption. The proposed MobileViT hardware accelerator with 112KB on-chip buffers occupies an area of just 1.64mm^2 on the TSMC 40nm process, and achieves a performance of 1.2 TOPS at 600 MHz with energy efficiency of 5.34 TOPS/W.
AB - With the great success of the Transformer model in Natural Language Processing (NLP), Vision Transformer (ViT) was proposed achieving comparable performance to traditional Convolutional Neural Network (CNN) models in tasks such as image classification and object detection. This paper focuses on the acceleration of a new lightweight hybrid model, named MobileViT, which has less computation complexity and higher accuracy compared with ViT and other CNN-based lightweight models such as MobileNets. We introduce an adaptive systolic array (SA) design with a flexible shape size, called LEGO SA, that enhances the efficiency of hardware utilization and memory accesses during standard convolution, Depth-wise Separable Convolution (DWC), and self-attention operations. Furthermore, matrix transpose in self-attention is implemented efficiently with significantly reduced wastage of execution time, memory buffers, and power consumption. The proposed MobileViT hardware accelerator with 112KB on-chip buffers occupies an area of just 1.64mm^2 on the TSMC 40nm process, and achieves a performance of 1.2 TOPS at 600 MHz with energy efficiency of 5.34 TOPS/W.
KW - Convolution
KW - deep neural network hardware accelerator
KW - MobileViT
KW - Self-attention
KW - Systolic Array
KW - Vision Transformer (ViT)
UR - http://www.scopus.com/inward/record.url?scp=85198503134&partnerID=8YFLogxK
U2 - 10.1109/ISCAS58744.2024.10558190
DO - 10.1109/ISCAS58744.2024.10558190
M3 - Conference contribution
AN - SCOPUS:85198503134
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
BT - ISCAS 2024 - IEEE International Symposium on Circuits and Systems
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Symposium on Circuits and Systems, ISCAS 2024
Y2 - 19 May 2024 through 22 May 2024
ER -