TY - GEN
T1 - A 28nm Energy-Area-Efficient Row-based pipelined Training Accelerator with Mixed FXP4/FP16 for On-Device Transfer Learning
AU - Lu, Wei
AU - Pei, Han Hsiang
AU - Yu, Jheng Rong
AU - Chen, Hung Ming
AU - Huang, Po Tsang
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Training deep convolutional neural networks (DNNs) requires significantly more computational capacity, complex dataflow, memory accesses, and data movement among processing elements (PEs), as well as higher bit precision for back propagation (BP), which demands more power and area overhead than DNN inference. For mobile/edge devices, energy and area efficiency are critical concerns. This research proposes a row-based pipelined DNN training accelerator that employs three techniques to improve energy and area efficiency for resource-constrained edge/mobile devices. The first technique involves freezing weight updates in convolution and batch normalization layers. The second technique involves decomposing the simulated quantization for convolutional layers and reorganizing the operations of batch normalization layers. The mathematical demonstration shows that FP convolution operations can be completed using fixed point (FXP) calculations. FXP MACs with dequantizer can replace the original FP MACs for convolutional layers. Additionally, a row-based FXP/FP pipelined training accelerator is designed for layers pipeline, convolution, and batch normalization layers to increase the FXP and FP resource utilization. The third method uses multi-bank buffer management to prevent data conflicts and reduce the need for on-chip buffers by up to 3.5 times. The proposed accelerator was implemented using the TSMC 28nm CMOS process and achieved an energy efficiency of 2.19 TFLOPS/W and an area efficiency of 85.32 GFLOPS/mm2. It outperforms state-of-the-art works with 6.8 times the area efficiency and 3.7 times the energy efficiency.
AB - Training deep convolutional neural networks (DNNs) requires significantly more computational capacity, complex dataflow, memory accesses, and data movement among processing elements (PEs), as well as higher bit precision for back propagation (BP), which demands more power and area overhead than DNN inference. For mobile/edge devices, energy and area efficiency are critical concerns. This research proposes a row-based pipelined DNN training accelerator that employs three techniques to improve energy and area efficiency for resource-constrained edge/mobile devices. The first technique involves freezing weight updates in convolution and batch normalization layers. The second technique involves decomposing the simulated quantization for convolutional layers and reorganizing the operations of batch normalization layers. The mathematical demonstration shows that FP convolution operations can be completed using fixed point (FXP) calculations. FXP MACs with dequantizer can replace the original FP MACs for convolutional layers. Additionally, a row-based FXP/FP pipelined training accelerator is designed for layers pipeline, convolution, and batch normalization layers to increase the FXP and FP resource utilization. The third method uses multi-bank buffer management to prevent data conflicts and reduce the need for on-chip buffers by up to 3.5 times. The proposed accelerator was implemented using the TSMC 28nm CMOS process and achieved an energy efficiency of 2.19 TFLOPS/W and an area efficiency of 85.32 GFLOPS/mm2. It outperforms state-of-the-art works with 6.8 times the area efficiency and 3.7 times the energy efficiency.
KW - DNN Training accelerator
KW - Frozen weights
KW - Multi-bank buffer management
KW - On-device transfer learning
KW - Simulated quantization
UR - http://www.scopus.com/inward/record.url?scp=85198555503&partnerID=8YFLogxK
U2 - 10.1109/ISCAS58744.2024.10558053
DO - 10.1109/ISCAS58744.2024.10558053
M3 - Conference contribution
AN - SCOPUS:85198555503
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
BT - ISCAS 2024 - IEEE International Symposium on Circuits and Systems
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Symposium on Circuits and Systems, ISCAS 2024
Y2 - 19 May 2024 through 22 May 2024
ER -