Deep learning applications are deployed to both resource and energy constrained edge devices via compact and sparse CNN models. However, sparsity, feature sizes and filter shapes are widely varying in deep networks resulting in inefficient resource utilization and data movement. In this paper, an energy-efficient accelerator is proposed for compact sparse CNNs by a flexible hierarchical on-chip interconnection architecture, 32 PE tiles and 3D-SRAM. 3D-SRAM are utilized as distributed memory for PE-tiles to hold intermediate data between layers for reducing the energy consumption of off-chip DRAM accesses. Based on distributed 3D-SRAM, output stationary dataflow is adopted without data movement of partial sums among PEs. Therefore, the 32 PE tiles are connected through a configurable ring-based unicast global network with micro-routers, which decreases implementation cost compared to a typical router for a mesh network. Each PE tile is implemented by an all-to-all local network to support widely varying sizes, shapes and non-zero activation computations of compact sparse CNNs. Overall, the proposed accelerator achieves 509.8 inference/sec, 1860.5 inference/J and 383.3 GOPS/W with MobileNetV2, and improves the energy efficiency by a factor of 1.43x over a dense architecture.