文献笔记(2)


文献摘自A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

1 introduction

  • A spatial(立体的) architecture based on a new CNN dataflow, called row stationary, which optimized for throughput and energy efficiency
  • An analysis framework that can quantify the energy efficiency of different CNN dataflows

spatial architecture

Spatial architecture(SAs) are a class of accelerators that can exploit high compute parallelism using direct communication between an array of relatively simple processing engines(PEs).

  • coarse-grained(粗糙) SAs
  • fine-grained SAs

在这里插入图片描述

2 CNN background

2.1 challenges in CNN processing

2.1.1 data handling

the MAC operation can run at high parallelism, which benefits throughput.

  • reading input for MACs directly from DRAM requires high bandwidth and incurs high energy consumption
  • a significant amount of intermediate data are generated by the parallel MACs simultaneously, which poses storage pressure.

input data reuse

  • convolutional reuse: reuse filter weight
  • filter reuse: reuse filter across the batch of N ifmaps in both CONV and FC layers
  • ifmap reuse: reuse ifmap pixel across M filters
    proper operation scheduling

2.1.2 adaptive processing

each layer have distinct shape configurations

2.2 CNN vs image processing

  • filter weights in CNNs is not fixed
  • ISP techniques are developed mainly for 2D convolutions

3 existing CNN dataflows

3.1 weight stationary(WS) dataflow

Each filter weight remains stationary in the RF to maximize convolutional reuse and filter reuse.

3.2 output stationary(OS) dataflow

The accumulation of each ofmap pixel stays stationary in a PE. The psums are stored in the same RF for accumulation to minimize the psum accumulation cost.
在这里插入图片描述

3.3 no local reuse(NLR) dataflow

It does not exploit data reuse at the RF level, and uses inter-PE communication for ifmap reuse and psum accumulation.

4 Row stationary dataflow

4.1 convolution primitives(原始的)

It breaks the high-dimensional convolution down into 1D convolution primitives that can run in parallel: each primitive operates on one row of filter weights and one row of ifmap pixels, and generates one row of psums.

4.2 two-step primitive mapping

4.2.1 logical mapping

The logical mapping first deploys the primitives into a logical PE array, which has the same size as the number of 1D convolution primitives and is usually much larger than the physical PE array in hardware.

4.2.2 physical mapping

The physical mapping folds the logical PE array so it fits into the physical PE array
在这里插入图片描述

4.3 energy-efficient data handling

  • RF
  • array(inter-PE communication)
  • global buffer

4.4 support for different layer types

  • FC layer: without convolutional data reuse
  • pool layer: swapping the MAC computation with a max comparison function in the ALU of each PE.

4.5 other architectural features

猜你喜欢

转载自blog.csdn.net/tiaozhanzhe1900/article/details/83069854