文章目录

1 introduction
spatial architecture
2 CNN background

2.1 challenges in CNN processing

2.1.1 data handling
2.1.2 adaptive processing

2.2 CNN vs image processing

3 existing CNN dataflows

3.1 weight stationary(WS) dataflow
3.2 output stationary(OS) dataflow
3.3 no local reuse(NLR) dataflow

4 Row stationary dataflow
4.1 convolution primitives(原始的)
4.2 two-step primitive mapping

4.2.1 logical mapping
4.2.2 physical mapping

4.3 energy-efficient data handling
4.4 support for different layer types
4.5 other architectural features

文献摘自A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

1 introduction

A spatial(立体的) architecture based on a new CNN dataflow, called row stationary, which optimized for throughput and energy efficiency
An analysis framework that can quantify the energy efficiency of different CNN dataflows

spatial architecture

Spatial architecture(SAs) are a class of accelerators that can exploit high compute parallelism using direct communication between an array of relatively simple processing engines(PEs).

coarse-grained(粗糙) SAs
fine-grained SAs

在这里插入图片描述

2 CNN background

2.1 challenges in CNN processing

2.1.1 data handling

the MAC operation can run at high parallelism, which benefits throughput.

reading input for MACs directly from DRAM requires high bandwidth and incurs high energy consumption
a significant amount of intermediate data are generated by the parallel MACs simultaneously, which poses storage pressure.

input data reuse

convolutional reuse: reuse filter weight
filter reuse: reuse filter across the batch of N ifmaps in both CONV and FC layers
ifmap reuse: reuse ifmap pixel across M filters
proper operation scheduling

2.1.2 adaptive processing

each layer have distinct shape configurations

2.2 CNN vs image processing

filter weights in CNNs is not fixed
ISP techniques are developed mainly for 2D convolutions

3 existing CNN dataflows

3.1 weight stationary(WS) dataflow

Each filter weight remains stationary in the RF to maximize convolutional reuse and filter reuse.

3.2 output stationary(OS) dataflow

The accumulation of each ofmap pixel stays stationary in a PE. The psums are stored in the same RF for accumulation to minimize the psum accumulation cost.
在这里插入图片描述

3.3 no local reuse(NLR) dataflow

It does not exploit data reuse at the RF level, and uses inter-PE communication for ifmap reuse and psum accumulation.

4 Row stationary dataflow

4.1 convolution primitives(原始的)

It breaks the high-dimensional convolution down into 1D convolution primitives that can run in parallel: each primitive operates on one row of filter weights and one row of ifmap pixels, and generates one row of psums.

4.2 two-step primitive mapping

4.2.1 logical mapping

The logical mapping first deploys the primitives into a logical PE array, which has the same size as the number of 1D convolution primitives and is usually much larger than the physical PE array in hardware.

4.2.2 physical mapping

The physical mapping folds the logical PE array so it fits into the physical PE array
在这里插入图片描述

4.3 energy-efficient data handling

RF
array(inter-PE communication)
global buffer

4.4 support for different layer types

FC layer: without convolutional data reuse
pool layer: swapping the MAC computation with a max comparison function in the ALU of each PE.

4.5 other architectural features

文献笔记（2）