TY - GEN
T1 - Energy-Efficient Accelerator Design with 3D-SRAM and Hierarchical Interconnection Architecture for Compact Sparse CNNs
AU - Lo, Chin Yang
AU - Huang, Po Tsang
AU - Hwang, Wei
PY - 2020/8
Y1 - 2020/8
N2 - Deep learning applications are deployed to both resource and energy constrained edge devices via compact and sparse CNN models. However, sparsity, feature sizes and filter shapes are widely varying in deep networks resulting in inefficient resource utilization and data movement. In this paper, an energy-efficient accelerator is proposed for compact sparse CNNs by a flexible hierarchical on-chip interconnection architecture, 32 PE tiles and 3D-SRAM. 3D-SRAM are utilized as distributed memory for PE-tiles to hold intermediate data between layers for reducing the energy consumption of off-chip DRAM accesses. Based on distributed 3D-SRAM, output stationary dataflow is adopted without data movement of partial sums among PEs. Therefore, the 32 PE tiles are connected through a configurable ring-based unicast global network with micro-routers, which decreases implementation cost compared to a typical router for a mesh network. Each PE tile is implemented by an all-to-all local network to support widely varying sizes, shapes and non-zero activation computations of compact sparse CNNs. Overall, the proposed accelerator achieves 509.8 inference/sec, 1860.5 inference/J and 383.3 GOPS/W with MobileNetV2, and improves the energy efficiency by a factor of 1.43x over a dense architecture.
AB - Deep learning applications are deployed to both resource and energy constrained edge devices via compact and sparse CNN models. However, sparsity, feature sizes and filter shapes are widely varying in deep networks resulting in inefficient resource utilization and data movement. In this paper, an energy-efficient accelerator is proposed for compact sparse CNNs by a flexible hierarchical on-chip interconnection architecture, 32 PE tiles and 3D-SRAM. 3D-SRAM are utilized as distributed memory for PE-tiles to hold intermediate data between layers for reducing the energy consumption of off-chip DRAM accesses. Based on distributed 3D-SRAM, output stationary dataflow is adopted without data movement of partial sums among PEs. Therefore, the 32 PE tiles are connected through a configurable ring-based unicast global network with micro-routers, which decreases implementation cost compared to a typical router for a mesh network. Each PE tile is implemented by an all-to-all local network to support widely varying sizes, shapes and non-zero activation computations of compact sparse CNNs. Overall, the proposed accelerator achieves 509.8 inference/sec, 1860.5 inference/J and 383.3 GOPS/W with MobileNetV2, and improves the energy efficiency by a factor of 1.43x over a dense architecture.
UR - http://www.scopus.com/inward/record.url?scp=85084998487&partnerID=8YFLogxK
U2 - 10.1109/AICAS48895.2020.9073944
DO - 10.1109/AICAS48895.2020.9073944
M3 - Conference contribution
AN - SCOPUS:85084998487
T3 - Proceedings - 2020 IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2020
SP - 320
EP - 323
BT - Proceedings - 2020 IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2020
Y2 - 31 August 2020 through 2 September 2020
ER -