TY - GEN
T1 - A 28nm 343.5fps/W Vision Transformer Accelerator with Integer-Only Quantized Attention Block
AU - Lin, Cheng Chen
AU - Lu, Wei
AU - Huang, Po Tsang
AU - Chen, Hung Ming
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Vision Transformer (ViT) has achieved state-of-the-art performance on various computer vision tasks. For the mobile/edge device, the energy efficiency is the most important issue. However, ViT requires huge computation and storage, which makes it difficult to be deployed on mobile/edge device. In this work, we focus on algorithm level and hardware level to improve efficiency of ViT inference. At algorithm level, we proposed energy efficient ViT model by adopting 4bit Quantization and Low-Rank Approximation to convert all the non-linear functions with floating point (FP) values in Multi-Head Attention (MHA) to linear function with integer (INT) values, to decrease the overhead caused by computation and storage. There are less accuracy drop compare with full-precision (<1.5%). At hardware level, we design an energy efficient row-based pipelined ViT accelerator for on-device inference. The proposed accelerator is consisted of integer-only quantizer, integer MACs PE array used for executing quantization and matrix operations, and approximated linear block adopted for executing low-rank approximation. As we know, in the research of ViT, this is the first accelerator uses 4-bits quantization and designs quantizer to operate integer-only quantization for on-device inference. This work can achieve energy efficiency of 343.5 fps/W and improve up to 8x energy efficiency compare to state-of-art works.
AB - Vision Transformer (ViT) has achieved state-of-the-art performance on various computer vision tasks. For the mobile/edge device, the energy efficiency is the most important issue. However, ViT requires huge computation and storage, which makes it difficult to be deployed on mobile/edge device. In this work, we focus on algorithm level and hardware level to improve efficiency of ViT inference. At algorithm level, we proposed energy efficient ViT model by adopting 4bit Quantization and Low-Rank Approximation to convert all the non-linear functions with floating point (FP) values in Multi-Head Attention (MHA) to linear function with integer (INT) values, to decrease the overhead caused by computation and storage. There are less accuracy drop compare with full-precision (<1.5%). At hardware level, we design an energy efficient row-based pipelined ViT accelerator for on-device inference. The proposed accelerator is consisted of integer-only quantizer, integer MACs PE array used for executing quantization and matrix operations, and approximated linear block adopted for executing low-rank approximation. As we know, in the research of ViT, this is the first accelerator uses 4-bits quantization and designs quantizer to operate integer-only quantization for on-device inference. This work can achieve energy efficiency of 343.5 fps/W and improve up to 8x energy efficiency compare to state-of-art works.
KW - Integer-Only Quantization
KW - Low-Rank Approximation
KW - On-Device Inference
KW - Vision Transformer (ViT)
UR - http://www.scopus.com/inward/record.url?scp=85199860406&partnerID=8YFLogxK
U2 - 10.1109/AICAS59952.2024.10595969
DO - 10.1109/AICAS59952.2024.10595969
M3 - Conference contribution
AN - SCOPUS:85199860406
T3 - 2024 IEEE 6th International Conference on AI Circuits and Systems, AICAS 2024 - Proceedings
SP - 80
EP - 84
BT - 2024 IEEE 6th International Conference on AI Circuits and Systems, AICAS 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th IEEE International Conference on AI Circuits and Systems, AICAS 2024
Y2 - 22 April 2024 through 25 April 2024
ER -