gpu的优势

ok先借由nvidia官网（https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction）展示一下为啥用gpu，而不利用gpu，下图是每秒的浮点计算。

为啥gpu那么快？官网提到：

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 3.

CPU和GPU之间差异背后的原因在，浮点能力中，gpu是专门用于计算密集,高速并行计算,尤其图形渲染的,因此,设计更多的晶体管用于数据处理,而不是数据缓存和流控制（而cpu是）,如图3示意图所示。所以gpu数据处理的单元更多如下图。

ok，现在应该知道了gpu的优势了，那我们开始学习gpu的一些结构，方便以后的开发。

Kernels

官方：

CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.

CUDA C通过允许程序员定义名为内核的C函数来扩展C语言，当被调用时，它会被N个不同的CUDA线程并行执行N次，而不是像常规的C函数那样。

A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<...>>>execution configuration syntax (see C Language Extensions). Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.

内核是使用__global__说明符来定义的，并且使用新的执行配置语法（参见C语言扩展）。执行内核的每一个线程都有一个独一无二的线程ID（thread ID)，它可以通过内置的threadIdx变量在内核中访问,（thradIdx 相当于指针索引thread id的感觉）

eg：

__global__ void VecAdd(float* A, float* B, float* C)
{...}
VecAdd<<<1, N>>>(A, B, C);

vecadd是一个kernel，通过__global__ 定义；
dg=1，代表网格中线程块数；
db=N，代表线程块中的线程数目；
A,B,C是threadIdx 索引，很明显，这个是三维的；

这里做一下说明：threadIDX可以是1-d，2-d，3-d。线程的索引及其线程ID以直接的方式相互关联：对于一维块，它们是相同的; 对于大小为二维的块（D x，D y），索引（x，y）的线程的线程ID 是（x + y D x） ; 对于大小为三维的块 （D x，D y，D z），索引（x，y，z）的线程的线程ID 是（x + y D x+ z D x D y）。

下图以二维的线程块作为实例：

a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.

内核可以由多个相等的线程块来执行（下图上半部分），这样线程的总数就等于每个块的线程数乘以块的数量。一个块又分为很多线程（二维时如下图）

ok，至此为止，线程与线程块应该有比较基本的了解了。

参考：《cuda高性能并行计算》

nvidia：https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction

win10 cuda_小白之旅（3）：了解cuda

gpu的优势

Kernels

猜你喜欢