A Real-Time 1280,x,720 Object Detection Chip With 585 MB/s Memory Traffic

Kuo Wei Chang, Hsu Tung Shih, Tian Sheuan Chang, Shang Hong Tsai, Chih Chyau Yang, Chien Ming Wu, Chun Ming Huang

Research output: Contribution to journalArticlepeer-review

Abstract

Memory bandwidth has become the real-time bottleneck of current deep learning accelerators (DLAs), particularly for high definition (HD) object detection. Under resource constraints, this article proposes a low memory traffic DLA chip with joint hardware and software optimization. To maximize hardware utilization under memory bandwidth, we morph and fuse the object detection model into a group fusion-ready model to reduce intermediate data access. This reduces the YOLOv2's feature memory traffic from 2.9 to 0.15 GB/s. To support group fusion, our previous DLA-based hardware employees a unified buffer with write-masking for simple layer-by-layer processing in a fusion group. When compared to our previous DLA with the same processing element (PE) numbers, the chip implemented in a 40-nm process supports 1280x720 at 30 frames per second (FPS) object detection and consumes 7.9x less external dynamic random access memory (DRAM) access energy, from 2607 to 327.6 mJ.

Keywords

  • Data models
  • Deep learning
  • Deep learning accelerator (DLA)
  • Hardware
  • high definition (HD)
  • layer fusion
  • Object detection
  • object detection.
  • Random access memory
  • Real-time systems
  • Training

Fingerprint

Dive into the research topics of 'A Real-Time 1280,x,720 Object Detection Chip With 585 MB/s Memory Traffic'. Together they form a unique fingerprint.

Cite this