1. Introduction

Thrust是一个基于标准模板库（STL）的C++模板库，用于CUDA。Thrust通过与CUDA C完全互操作的高级接口，使您能够以最小的编程工作量实现高性能并行应用程序。

Thrust提供了丰富的数据并行原语，如扫描、排序和归约等，可以组合在一起以实现具有简洁、可读源代码的复杂算法。通过用这些高级抽象描述计算，您为Thrust提供了自动选择最有效实现的自由。因此，Thrust可用于CUDA应用程序的快速原型设计，其中程序员的生产力最为重要，也可用于生产中，其中健壮性和绝对性能至关重要。

本文档介绍了如何使用Thrust开发CUDA应用程序。即使您具有有限的C++或CUDA经验，本教程也旨在易于理解。这里结合了官方的文档以及自己的例子进行了完善

2. Vectors

Thrust提供了两种向量容器，host_vector和device_vector。正如名称所示，host_vector存储在主机内存中，而device_vector存储在GPU设备内存中。Thrust的向量容器就像C++ STL中的std::vector一样。与std::vector一样，host_vector和device_vector是通用容器（能够存储任何数据类型），可以动态调整大小。下面的源代码演示了如何使用Thrust的向量容器。

 #include <thrust/host_vector.h>
 #include <thrust/device_vector.h>

 #include <iostream>

int main(void)
{
    
    
  // H可以存储4个整数
  thrust::host_vector<int> H(4);

  // 初始化单个元素
  H[0] = 14;
  H[1] = 20;
  H[2] = 38;
  H[3] = 46;
    
  // H.size()返回向量H的大小
  std::cout << "H has size " << H.size() << std::endl;

  // 打印H的内容
  for(int i = 0; i < H.size(); i++)
  {
    
    
    std::cout << "H[" << i << "] = " << H[i] << std::endl;
  }

  // 调整H
  H.resize(2);
    
  std::cout << "H now has size " << H.size() << std::endl;

  // 复制主机向量H到设备向量D
  thrust::device_vector<int> D = H;
    
  // D中的元素可以修改
  D[0] = 99;
  D[1] = 88;
    
  // 打印D的内容
  for(int i = 0; i < D.size(); i++)
  {
    
    
    std::cout << "D[" << i << "] = " << D[i] << std::endl;
  }

  // H和D在函数返回时自动被销毁
  return 0;
}

正如这个例子所示，= 运算符可用于将 host_vector 复制到 device_vector（或反之亦然）。= 运算符也可用于将 host_vector 复制到 host_vector 或 device_vector 复制到 device_vector。此外，还应注意，可以使用标准括号表示法访问 device_vector 的各个元素。但是，由于每个访问都需要调用 cudaMemcpy，因此应该谨慎使用。我们将在后面看一些更有效的技术。

通常将向量的所有元素初始化为特定值或仅复制某个值集合以从一个向量复制到另一个向量很有用。Thrust 提供了一些执行这些类型操作的方式。

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/sequence.h>

#include <iostream>

int main(void)
{
    
    
  // 初始化device_vector中的所有10个整数为1
  thrust::device_vector<int> D(10, 1);

  // 将向量的前7个元素设置为9
  thrust::fill(D.begin(), D.begin() + 7, 9);

  // 用D的前5个元素初始化host_vector
  thrust::host_vector<int> H(D.begin(), D.begin() + 5);

  // 设置H的元素为0,1,2,3，…sequence（顺序）
  thrust::sequence(H.begin(), H.end());

  // 复制所有H到D的开头
  thrust::copy(H.begin(), H.end(), D.begin());

  // 打印D
  for(int i = 0; i < D.size(); i++)
  {
    
    
    std::cout << "D[" << i << "] = " << D[i] << std::endl;
  }

  return 0;
}

这里我们展示了fill、copy和sequence函数的使用。copy函数可用于将一段host或device元素的范围复制到另一个host或device向量中。与相应的STL函数类似，thrust::fill只是将一段元素设置为特定值。Thrust的**sequence函数可用于创建一系列等间隔值**。

2.1 Thrust命名空间

您会注意到我们在示例中使用了thrust::host_vector或thrust::copy等内容。thrust::部分告诉C++编译器我们要查找thrust命名空间中的特定函数或类。命名空间是避免名称冲突的一种好方法。例如，thrust::copy与STL中提供的std::copy不同。C++命名空间使我们能够区分这两个copy函数。

2.2 迭代器和静态分派

在本节中，我们使用了类似H.begin()和H.end()的表达式或类似D.begin() + 7的偏移量。begin()和end()的结果在C++中称为迭代器。对于向量容器，它们实际上只是数组，迭代器可以被视为指向数组元素的指针。因此，H.begin()是一个迭代器，指向存储在H向量中的第一个元素。类似地，H.end()指向H向量的最后一个元素后面的一个元素。

尽管向量迭代器类似于指针，但它们携带更多的信息。请注意，我们不需要告诉thrust::fill它正在操作一个device_vector迭代器。这个信息被捕获在D.begin()返回的迭代器类型中，该类型与H.begin()返回的类型不同。当调用Thrust函数时，它检查迭代器的类型以确定是否使用host或device实现。这个过程被称为静态分派，因为host/device分派在编译时解决。请注意，这意味着分派过程没有运行时开销。

您可能会想知道当一个“原始”指针被用作Thrust函数的参数时会发生什么。像STL一样，Thrust允许这种用法，并将调度算法的host路径。如果所讨论的指针实际上是指向device内存的指针，则需要在调用函数之前使用thrust::device_ptr将其包装起来。例如：

size_t N = 10;

// 指向设备内存的原始指针
int * raw_ptr;
cudaMalloc((void **) &raw_ptr, N * sizeof(int));

// 用device_ptr包装原始指针
thrust::device_ptr<int> dev_ptr(raw_ptr);

// 在thrust算法中使用PTR设备
thrust::fill(dev_ptr, dev_ptr + N, (int) 0);

使用 raw_pointer_cast 函数可以从 device_ptr 中提取一个原始指针，如下所示：

size_t N = 10;

// 创建device_ptr
thrust::device_ptr<int> dev_ptr = thrust::device_malloc<int>(N);
     
// 从device_ptr中提取原始指针
int * raw_ptr = thrust::raw_pointer_cast(dev_ptr);

另一个将迭代器和指针区分开的原因是迭代器可用于遍历许多种数据结构。例如，STL提供了一个双向迭代器（但不支持随机访问）的链表容器（std::list）。尽管Thrust不提供此类容器的设备实现，但它与它们兼容。

#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <list>
#include <vector>

int main(void)
{
    
    
  // 创建一个包含4个值的STL列表
  std::list<int> stl_list;

  stl_list.push_back(10);
  stl_list.push_back(20);
  stl_list.push_back(30);
  stl_list.push_back(40);

  // 用列表初始化一个设备向量
  thrust::device_vector<int> D(stl_list.begin(), stl_list.end());

  // 复制一个设备向量到一个STL向量
  std::vector<int> stl_vector(D.size());
  thrust::copy(D.begin(), D.end(), stl_vector.begin());

  return 0;
}

到目前为止，我们介绍的迭代器虽然有用，但相当基础。除了这些常规迭代器之外，Thrust还提供了一组名为counting_iterator和zip_iterator等花哨的迭代器。虽然它们看起来和感觉像普通迭代器，但是花哨的迭代器可以做更令人兴奋的事情。我们将在教程的后面重新讨论这个话题。

int main(void)
{
    
    
    // 这个例子计算序列中所有非零值的索引

    // 零和非零值的序列
    thrust::device_vector<int> stencil(8);
    stencil[0] = 0;
    stencil[1] = 1;
    stencil[2] = 1;
    stencil[3] = 0;
    stencil[4] = 0;
    stencil[5] = 1;
    stencil[6] = 0;
    stencil[7] = 1;

    // 非零索引的存储
    thrust::device_vector<int> indices(8);
    
    // 计数迭代器定义一个序列[0,8)
    thrust::counting_iterator<int> first(0);
    thrust::counting_iterator<int> last = first + 8;

    // 计算非零元素的索引
    typedef thrust::device_vector<int>::iterator IndexIterator;

    IndexIterator indices_end = thrust::copy_if(first, last,
                                                stencil.begin(),
                                                indices.begin(),
                                                thrust::identity<int>());
    // 索引现在包含[1,2,5,7]

    // 打印结果
    std::cout << "found " << (indices_end - indices.begin()) << " nonzero values at indices:\n";
    thrust::copy(indices.begin(), indices_end, std::ostream_iterator<int>(std::cout, "\n"));

    return 0;
}

3. Algorithms

Thrust提供了大量常见的并行算法。其中许多算法在C ++标准库中都有直接类比，当存在等效的标准库函数时，我们选择名称（例如thrust::sort和std::sort）。
Thrust中的所有算法都具有主机和设备的实现。具体来说，当使用主机迭代器调用Thrust算法时，将调度主机路径。类似地，当使用设备迭代器定义范围时，将调用设备实现。
除了thrust::copy可以在主机和设备之间复制数据之外，**Thrust算法的所有迭代器参数都应该位于同一位置：要么全部在主机上，要么全部在设备上。**违反此要求时，编译器将生成错误消息。

3.1 Transformations

转换是将操作应用于一组（零个或多个）输入范围中的每个元素，然后将结果存储在目标范围内的算法。我们已经看到的一个例子是thrust::fill，它将范围的所有元素设置为指定值。其他转变包括thrust::sequence，thrust::replace还有thrust::transform。有关完整列表，请参阅文档。
以下源代码演示了几种转换算法。请注意，thrust::negate与thrust::modulus已知为函子在C++术语。推力提供这些和其他常见的仿函数像plus和multiplies文件中thrust/functional.h。

#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/replace.h>
#include <thrust/functional.h>
#include <iostream>

int main(void)
{
    
    
  // 分配3个device_vector，每个device_vector包含10个元素
  thrust::device_vector<int> X(10);
  thrust::device_vector<int> Y(10);
  thrust::device_vector<int> Z(10);

  // 初始化X为0、1、2、3、....
  thrust::sequence(X.begin(), X.end());

  // 计算Y = -X
  thrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>());

  // 用2填充Z
  thrust::fill(Z.begin(), Z.end(), 2);

  // 计算Y = X取2的模
  thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus<int>());

  // 把Y中的1换成10
  thrust::replace(Y.begin(), Y.end(), 1, 10);

  // 打印Y
  thrust::copy(Y.begin(), Y.end(), std::ostream_iterator<int>(std::cout, "\n"));
   
  return 0;    
}

虽然 thrust/functional.h 中的函数对象涵盖了大多数内置的算术和比较操作，但我们经常需要做一些不同的事情。例如，考虑向量操作 y <- a * x + y，其中x 和 y 是向量，a 是标量常量。这是任何 BLAS 库提供的著名的 SAXPY 操作。

如果我们想要使用 Thrust 实现 SAXPY，有几种选择。第一种是使用两个转换（一个加法和一个乘法）和一个填充有值 a 的临时向量。更好的选择是使用一个单独的转换，并使用用户定义的函数对象执行我们想要的操作。我们在下面的源代码中演示了这两种方法。

struct saxpy_functor
{
    
    
  const float a;

  saxpy_functor(float _a) : a(_a) {
    
    }

  __host__ __device__
  float operator()(const float& x, const float& y) const
  {
    
     
    return a * x + y;
  }
};

void saxpy_fast(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y)
{
    
    
  // Y <- A * X + Y
  thrust::transform(X.begin(), X.end(), Y.begin(), Y.begin(), saxpy_functor(A));
}

void saxpy_slow(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y)
{
    
    
  thrust::device_vector<float> temp(X.size());
   
  // temp <- A
  thrust::fill(temp.begin(), temp.end(), A);
    
  // temp <- A * X
  thrust::transform(X.begin(), X.end(), temp.begin(), temp.begin(), thrust::multiplies<float>());

  // Y <- A * X + Y
  thrust::transform(temp.begin(), temp.end(), Y.begin(), Y.begin(), thrust::plus<float>());
}

saxpy_fast和saxpy_slow都是有效的SAXPY实现，但saxpy_fast比saxpy_slow快得多。忽略分配临时向量和算术运算的成本，我们有以下成本：

fast_saxpy：执行2N个读取和N个写入

slow_saxpy：执行4N个读取和3N个写入

由于SAXPY是内存限制型的（其性能受内存带宽限制，而不是浮点性能），更多的读写操作使saxpy_slow更加昂贵。相比之下，saxpy_fast将执行得与优化的BLAS实现中的SAXPY一样快。在像SAXPY这样的内存限制算法中，通常值得应用内核融合（将多个操作组合成单个内核）来最小化内存事务的数量。

thrust::transform仅支持具有一个或两个输入参数的转换（例如 $f(x)\rightarrow y$
和 $f(x,x)\rightarrow y$ ）。当转换使用超过两个的输入参数时，需要使用不同的方法。arbitrary_transformation示例演示了使用thrust::zip_iterator``thrust::for_each的解决方案。

3.2. Reductions

缩减算法使用二元操作将输入序列缩减为单个值。例如，使用加法操作将数字数组的总和减小为一个值。同样，使用一个接受两个输入并返回最大值的运算符来减少数组以获得最大值。使用thrust::reduce实现数组的总和如下：

int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus<int>());

reduce 的前两个参数定义了值的范围，第三个和第四个参数分别提供了初始值和约简运算符。实际上，当没有提供初始值或运算符时，这种类型的约简是如此常见，以至于它是默认选择。因此，以下三行代码等效：

int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus<int>());
int sum = thrust::reduce(D.begin(), D.end(), (int) 0);
int sum = thrust::reduce(D.begin(), D.end());

虽然thrust::reduce足以实现各种减少，但是Thrust提供了一些额外的函数以便使用（就像STL一样）。例如，thrust::count返回给定序列中特定值的实例数：

#include <thrust/count.h>
#include <thrust/device_vector.h>
...
// 在device_vector中放入三个1
thrust::device_vector<int> vec(5,0);
vec[1] = 1;
vec[3] = 1;
vec[4] = 1;

// 数1
int result = thrust::count(vec.begin(), vec.end(), 1);
// 返回3

其他的归约操作包括 thrust::count_if、thrust::min_element、thrust::max_element、thrust::is_sorted、thrust::inner_product以及其他一些。请参考文档以获得完整的列表。对应的函数查找可以这个网站

    // 输入尺寸
    size_t N = 10;

    // 一些定义
    typedef thrust::device_vector<int> Vector;

    // 为阵列分配存储空间
    Vector values(N);

    // 初始化数组为[0,1,2，…]］
    thrust::sequence(values.begin(), values.end());
    size_t N_odd = thrust::count_if(values.begin(), values.end(), is_odd<int>());
	Vector::iterator   h_min = thrust::min_element(values.begin(), values.end());
    Vector::iterator   h_max = thrust::max_element(values.begin(), values.end());
    thrust::sort(values.begin(), values.end());
    bool result = thrust::is_sorted(thrust::device, values.begin(), values.end());


	float vec1[3] = {
    
    1.0f, 2.0f, 5.0f};
	float vec2[3] = {
    
    4.0f, 1.0f, 5.0f};
    float result = thrust::inner_product(thrust::host, vec1, vec1 + 3, vec2, 0.0f);//两个向量的点积

在这里插入图片描述

Transformations 部分的 SAXPY 示例展示了如何使用 kernel fusion 来减少转换核使用的内存传输次数。对于 reduction kernels，我们也可以使用 thrust::transform_reduce 来应用 kernel fusion。考虑下面的示例，它计算向量的范数。

//随意定义想要的归约方式
struct variance: std::unary_function<float, float>
{
    
    
	variance(float m): mean(m){
    
     }
	const float mean;
	__host__ __device__ float operator()(float data) const
	{
    
    
		return ::pow(data - mean, 2.0f);
	}
};
//需要提前通过reduce函数求和，从而获得均值mean
float variance = thrust::transform_reduce(dev_ptr,dev_ptr + N,variance(mean),0.0f,thrust::plus<float>()) / N;

3.3 Prefix-Sums

并行前缀和（Parallel prefix-sums）或扫描操作是许多并行算法（如流压缩和基数排序）中的重要构建块。考虑以下源代码，它演示了使用默认加法运算符的包容扫描操作：

#include <thrust/scan.h>

int data[6] = {
    
    1, 0, 2, 2, 1, 3};

thrust::inclusive_scan(data, data + 6, data); // 就地扫描

// data is now {1, 1, 3, 5, 6, 9}

在一个包含的扫描中，输出的每个元素都是输入范围对应的部分和。例如，data[2] = data[0] + data[1] + data[2]。一个独占扫描类似，但向右移动一位：

#include <thrust/scan.h>

int data[6] = {
    
    1, 0, 2, 2, 1, 3};

thrust::exclusive_scan(data, data + 6, data); // 就地扫描

// data is now {0, 1, 1, 3, 5, 6}

因此，现在 data[2] = data[0] + data[1]。正如这些示例所示，inclusive_scan和exclusive_scan可以原地执行。Thrust还提供了transform_inclusive_scan和transform_exclusive_scan函数，在执行扫描之前将一元函数应用于输入序列。有关扫描变体的完整列表，请参阅文档。

#include <thrust/transform_scan.h>
int data[6] = {
    
    1, 0, 2, 2, 1, 3};
thrust::negate<int> unary_op;
thrust::plus<int> binary_op;
thrust::transform_inclusive_scan(data, data + 6, data, unary_op, binary_op); // in-place scan
// data is now {-1, -1, -3, -5, -6, -9}


#include <thrust/transform_scan.h>
int data[6] = {
    
    1, 0, 2, 2, 1, 3};
thrust::negate<int> unary_op;
thrust::plus<int> binary_op;
thrust::transform_exclusive_scan(data, data + 6, data, unary_op, 4, binary_op); // in-place scan
// data is now {4, 3, 3, 1, -1, -2}

3.4 Reordering

Thrust通过以下算法提供对分区和流压缩的支持：

copy_if：复制通过谓词测试的元素

#include <thrust/copy.h>
...
struct is_even
{
    
    
  __host__ __device__
  bool operator()(const int x)
  {
    
    
    return (x % 2) == 0;
  }
};
...
const int N = 6;
int V[N] = {
    
    -2, 0, -1, 0, 1, 2};
int result[4];
thrust::copy_if(V, V + N, result, is_even());
// V remains {-2, 0, -1, 0, 1, 2}
// result is now {-2, 0, 0, 2}

partition：根据谓词重新排列元素（真值排在假值之前），下面的代码片段演示了如何使用分区对序列进行重新排序，使偶数位于奇数之前。

#include <thrust/partition.h>
...
struct is_even
{
    
    
  __host__ __device__
  bool operator()(const int &x)
  {
    
    
    return (x % 2) == 0;
  }
};
...
int A[] = {
    
    0, 1, 0, 1, 0, 1, 0, 1, 0,  1};
int S[] = {
    
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
const int N = sizeof(A)/sizeof(int);
thrust::partition(A, A + N, S, is_even());
// A 修改后值{1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
// S 未修改原值

remove和remove_if：删除未通过谓词测试的元素

#include <thrust/remove.h>
...
const int N = 6;
int A[N] = {
    
    3, 1, 4, 1, 5, 9};
int *new_end = thrust::remove(A, A + N, 1);
// The first four values of A are now {3, 4, 5, 9}
// Values beyond new_end are unspecified



#include <thrust/remove.h>
...
struct is_even
{
    
    
  __host__ __device__
  bool operator()(const int x)
  {
    
    
    return (x % 2) == 0;
  }
};
...
const int N = 6;
int A[N] = {
    
    1, 4, 2, 8, 5, 7};
int *new_end = thrust::remove_if(A, A + N, is_even());
// The first three values of A are now {1, 5, 7}
// Values beyond new_end are unspecified

unique：在序列内删除连续的重复项

#include <thrust/unique.h>
...
const int N = 7;
int A[N] = {
    
    1, 3, 3, 3, 2, 2, 1};
int *new_end = thrust::unique(A, A + N);
// The first four values of A are now {1, 3, 2, 1}
// Values beyond new_end are unspecified.

有关重新排序函数及其用法示例的完整列表，请参阅文档。

3.5 Sorting

Thrust提供了几个函数来根据给定的条件对数据进行排序或重新排列。thrust::sort和thrust::stable_sort函数是STL中sort和stable_sort的直接模拟。

#include <thrust/sort.h>
...
const int N = 6;
int A[N] = {
    
    1, 4, 2, 8, 5, 7};
thrust::sort(A, A + N);
// A is now {1, 2, 4, 5, 7, 8}


//下面的代码片段演示了如何使用sort对整数序列进行排序

#include <thrust/sort.h>
...
const int N = 6;
int A[N] = {
    
    1, 4, 2, 8, 5, 7};
thrust::stable_sort(A, A + N);
// A is now {1, 2, 4, 5, 7, 8}

此外，Thrust 还提供了 thrust::sort_by_key 和 thrust::stable_sort_by_key 函数，它们可以对分别存储在不同位置的键-值对进行排序。

#include <thrust/sort.h>
...
const int N = 6;
int    keys[N] = {
    
      1,   4,   2,   8,   5,   7};
char values[N] = {
    
    'a', 'b', 'c', 'd', 'e', 'f'};
thrust::sort_by_key(keys, keys + N, values);
// keys is now   {  1,   2,   4,   5,   7,   8}
// values is now {'a', 'c', 'b', 'e', 'f', 'd'}


#include <thrust/sort.h>
#include <thrust/execution_policy.h>
...
const int N = 6;
int    keys[N] = {
    
      1,   4,   2,   8,   5,   7};
char values[N] = {
    
    'a', 'b', 'c', 'd', 'e', 'f'};
thrust::stable_sort_by_key(thrust::host, keys, keys + N, values);
// keys is now   {  1,   2,   4,   5,   7,   8}
// values is now {'a', 'c', 'b', 'e', 'f', 'd'}

像STL中一样，排序函数也接受用户定义的比较运算符：

#include <thrust/sort.h>
#include <thrust/functional.h>
...
const int N = 6;
int A[N] = {
    
    1, 4, 2, 8, 5, 7};
thrust::stable_sort(A, A + N, thrust::greater<int>());
// A is now {8, 7, 5, 4, 2, 1}

4. Fancy Iterators

高级迭代器（fancy iterators）可以执行各种有价值的功能。在本节中，我们将展示如何使用高级迭代器使用标准Thrust算法攻击更广泛的问题类别。对于熟悉Boost C++库的人，请注意我们的高级迭代器的灵感来自于（并且通常源于）Boost迭代器库。

thrust工程化学习(一)----点云并行CUDA快速入门指南