TY - GEN
T1 - Design Exploration of An Energy-Efficient Acceleration System for CNNs on Low-Cost Resource-Constraint SoC-FPGAs
AU - Wen, Shao Cheng
AU - Huang, Po Tsang
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Deep convolutional neural networks (CNNs) require enormous computation capacity, great amounts of memory accesses and data movement among parallel processing elements (PEs). From an energy perspective, CNNs are difficult to be fully deployed to low-cost resource-constraint edge devices because of both memory-intensive and computation-intensive workloads. In this paper, energy-efficient software/hardware co-design is explored for CNN acceleration on a Xilinx resource-constraint SoC-FPGA device. The acceleration system is optimized based on the constraints of DRAM bandwidths, BRAM resources, computing resources, optimal frequency and the complexity of wire routing. Moreover, the efficient workload distribution and dataflow control are also implemented by both software and hardware to achieve the maximum resource utilization. Based on a low-cost Xilinx Zynq XC7Z020 SoC-FPGA device, the proposed acceleration system achieves the throughput of VGG16 and YOLOv3-tiny by 4.3 frame/s and 21 frame/s, respectively. Moreover, 34 GOPS/W and 38.9 GOPS/W can be realized for VGG16 and YOLOv3-tiny. Compared to other state-of-art designs on resource-constraint SoC-FPGA devices, the proposed acceleration system achieves the best energy efficiency with high resource utilization.
AB - Deep convolutional neural networks (CNNs) require enormous computation capacity, great amounts of memory accesses and data movement among parallel processing elements (PEs). From an energy perspective, CNNs are difficult to be fully deployed to low-cost resource-constraint edge devices because of both memory-intensive and computation-intensive workloads. In this paper, energy-efficient software/hardware co-design is explored for CNN acceleration on a Xilinx resource-constraint SoC-FPGA device. The acceleration system is optimized based on the constraints of DRAM bandwidths, BRAM resources, computing resources, optimal frequency and the complexity of wire routing. Moreover, the efficient workload distribution and dataflow control are also implemented by both software and hardware to achieve the maximum resource utilization. Based on a low-cost Xilinx Zynq XC7Z020 SoC-FPGA device, the proposed acceleration system achieves the throughput of VGG16 and YOLOv3-tiny by 4.3 frame/s and 21 frame/s, respectively. Moreover, 34 GOPS/W and 38.9 GOPS/W can be realized for VGG16 and YOLOv3-tiny. Compared to other state-of-art designs on resource-constraint SoC-FPGA devices, the proposed acceleration system achieves the best energy efficiency with high resource utilization.
UR - http://www.scopus.com/inward/record.url?scp=85139051420&partnerID=8YFLogxK
U2 - 10.1109/AICAS54282.2022.9869955
DO - 10.1109/AICAS54282.2022.9869955
M3 - Conference contribution
AN - SCOPUS:85139051420
T3 - Proceeding - IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2022
SP - 234
EP - 237
BT - Proceeding - IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 4th IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2022
Y2 - 13 June 2022 through 15 June 2022
ER -