TY - JOUR
T1 - ACNPU
T2 - A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator With Decoupled Asymmetric Convolution
AU - Yang, Tun Hao
AU - Chang, Tian Sheuan
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2024/2/1
Y1 - 2024/2/1
N2 - Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth. This challenge leads many accelerators to opt for simpler and shallow models like FSRCNN, compromising performance for real-time needs, especially for resource-limited edge devices. This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge. The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36% less complexity than FSRCNN, while maintaining a similar model size, with the decoupled asymmetric convolution and split-bypass structure. The hardware-friendly 17K-parameter model enables holistic model fusion instead of localized layer fusion to remove external DRAM access of intermediate feature maps. The on-chip memory bandwidth is further reduced with the input stationary flow and parallel-layer execution to reduce power consumption. Hardware is regular and easy to control to support different layers by processing elements (PEs) clusters with reconfigurable input and uniform data flow. The implementation in the 40 nm CMOS process consumes 2333 K gate counts and 198Â KB SRAMs. The ACNPU achieves 31.7 FPS and 124.4 FPS for ×2 and ×4 scales FullHD generation, respectively, which attains 4.75 TOPS/W energy efficiency.
AB - Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth. This challenge leads many accelerators to opt for simpler and shallow models like FSRCNN, compromising performance for real-time needs, especially for resource-limited edge devices. This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge. The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36% less complexity than FSRCNN, while maintaining a similar model size, with the decoupled asymmetric convolution and split-bypass structure. The hardware-friendly 17K-parameter model enables holistic model fusion instead of localized layer fusion to remove external DRAM access of intermediate feature maps. The on-chip memory bandwidth is further reduced with the input stationary flow and parallel-layer execution to reduce power consumption. Hardware is regular and easy to control to support different layers by processing elements (PEs) clusters with reconfigurable input and uniform data flow. The implementation in the 40 nm CMOS process consumes 2333 K gate counts and 198Â KB SRAMs. The ACNPU achieves 31.7 FPS and 124.4 FPS for ×2 and ×4 scales FullHD generation, respectively, which attains 4.75 TOPS/W energy efficiency.
KW - AI accelerator
KW - asymmetric convolution neural network
KW - Convolution neural network
KW - super resolution
UR - http://www.scopus.com/inward/record.url?scp=85179815122&partnerID=8YFLogxK
U2 - 10.1109/TCSI.2023.3336468
DO - 10.1109/TCSI.2023.3336468
M3 - Article
AN - SCOPUS:85179815122
SN - 1549-8328
VL - 71
SP - 670
EP - 679
JO - IEEE Transactions on Circuits and Systems I: Regular Papers
JF - IEEE Transactions on Circuits and Systems I: Regular Papers
IS - 2
ER -