TY - GEN
T1 - An SoC Integration Ready VLIW-Driven CNN Accelerator with High Utilization and Scalability
AU - Hu, Chia Heng
AU - Tseng, I. Hao
AU - Kuo, Pei Hsuan
AU - Huang, Juinn Dar
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - In this paper a highly scalable VLIW-driven CNN accelerator architecture is proposed. A new VLIW instruction, which specifies all settings of an entire convolution layer and natively supports layer concatenation, is defined. A multi-mode input aligner (MMIA) is developed to efficiently organize input data for various convolution modes. A zero-initial-latency (ZIL) buffer is created to further boost the performance. A strip-based dataflow is adopted to drastically minimize external DRAM accesses. The accelerator is also equipped with an AXI4 on-chip bus interface, an instruction queue, ping-pong DRAM I/O buffers, and is thus ready for efficient and easy SoC integration. An accelerator instance with 576 MACs has been implemented using TSMC 40nm process. The core logic only requires 490K gates and the total internal memory size is merely 286KB. The peak performance is 1440 GOPS @1.25GHz and the core power efficiency is 8.71 TOPS/W. Moreover, the proposed accelerator has also enabled a real-time image semantic segmentation system for autonomous driving on an FPGA system.
AB - In this paper a highly scalable VLIW-driven CNN accelerator architecture is proposed. A new VLIW instruction, which specifies all settings of an entire convolution layer and natively supports layer concatenation, is defined. A multi-mode input aligner (MMIA) is developed to efficiently organize input data for various convolution modes. A zero-initial-latency (ZIL) buffer is created to further boost the performance. A strip-based dataflow is adopted to drastically minimize external DRAM accesses. The accelerator is also equipped with an AXI4 on-chip bus interface, an instruction queue, ping-pong DRAM I/O buffers, and is thus ready for efficient and easy SoC integration. An accelerator instance with 576 MACs has been implemented using TSMC 40nm process. The core logic only requires 490K gates and the total internal memory size is merely 286KB. The peak performance is 1440 GOPS @1.25GHz and the core power efficiency is 8.71 TOPS/W. Moreover, the proposed accelerator has also enabled a real-time image semantic segmentation system for autonomous driving on an FPGA system.
KW - convolutional neural network (CNN)
KW - hardware accelerator
KW - high performance
KW - low power
KW - SoC integration ready
KW - very long instruction word (VLIW)
UR - http://www.scopus.com/inward/record.url?scp=85139036374&partnerID=8YFLogxK
U2 - 10.1109/AICAS54282.2022.9870010
DO - 10.1109/AICAS54282.2022.9870010
M3 - Conference contribution
AN - SCOPUS:85139036374
T3 - Proceeding - IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2022
SP - 246
EP - 249
BT - Proceeding - IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 4th IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2022
Y2 - 13 June 2022 through 15 June 2022
ER -