一、CPU与GPU

在这里插入图片描述

二、数据迁移

数据在GPU和CPU之间迁移：
在这里插入图片描述
数据迁移使用的方法：to()函数

进行迁移的数据：Tensor和Module

2.1 to函数

to函数：转换数据类型/设备

tensor.to(*args, **kwargs)
module.to(*args, **kwargs)

区别：张量不执行inplace，模型执行inplace
张量执行to函数之后，会重新构建一个新的张量，而module执行to函数是会执行inplace操作的

2.1.1 tensor.to()

张量当中的to函数使用：

# to函数转换数据类型
x = torch.ones((3, 3))
x = x.to(torch.float64)

# to函数数据迁移到设备上
x = torch.ones((3, 3))
x = x.to("cuda")

代码示例：

# -*- coding: utf-8 -*-

import torch
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# ========================== tensor to cuda
# flag = 0
flag = 1
if flag:
    x_cpu = torch.ones((3, 3))
    print("x_cpu:\ndevice: {} is_cuda: {} id: {}".format(x_cpu.device, x_cpu.is_cuda, id(x_cpu)))

    x_gpu = x_cpu.to(device)
    print("x_gpu:\ndevice: {} is_cuda: {} id: {}".format(x_gpu.device, x_gpu.is_cuda, id(x_gpu)))


# x_gpu = x_cpu.cuda()  # 该方法已弃用

在这里插入图片描述

2.1.2 module.to()

module中的to函数使用：

# to函数转换数据类型
linear = nn.Linear(2, 2)
linear.to(torch.double)

# to函数数据迁移到设备上
gpu1 = torch.device("cuda")
linear.to(gpu1)

说明：
module数据类型转换，是将网络层的所有参数数据都转换为指定类型
module数据迁移和tensor一样，都是将其迁移到指定的设备上

# -*- coding: utf-8 -*-

import torch
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# ========================== module to cuda
# flag = 0
flag = 1
if flag:
    net = nn.Sequential(nn.Linear(3, 3))

    print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))

    net.to(device)
    print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))

在这里插入图片描述
由上图可知，module在数据迁移前后的内存地址不变，所以说明module的to函数操作是inplace是操作

2.2 torch.cuda常用方法

torch.cuda常用方法

torch.cuda.device_count()：计算当前可见可用gpu数
torch.cuda.get_device_name()：获取gpu名称
torch.cuda.manual_seed()：为当前gpu设置随机种子
torch.cuda.manual_seed_all()：为所有可见可用gpu设置随机种子
torch.cuda.set_device()：设置主gpu为哪一个物理gpu（不推荐）
推荐： os.environ.setdefault(“CUDA_VISIBLE_DEVICES”, “2, 3”)

2.2.1 逻辑gpu和物理gpu

在这里插入图片描述
物理gpu：就是实实在在插在主机上的gpu显卡
逻辑gpu：是python脚本中设置的可见的 gpu，其个数小于等于物理gpu数

os.environ.setdefault(“CUDA_VISIBLE_DEVICES”, “2, 3”)
当使用了上述命令，则逻辑gpu个数就是设置可见gpu的个数，这里也就是2，而 "2, 3"指明逻辑gpu对应的物理gpu，也就是，逻辑gpu0对应物理gpu2，罗家gpu1对应物理gpu3

在逻辑gpu中，通常会设置主gpu，默认第0个gpu为主gpu，而划分主gpu与多gpu分发并行机制有关

三、多GPU并行运算

3.1 多gpu运算的分发并行机制

在这里插入图片描述
多GPU并行运算举例说明：
假设小明有4份作业要做，每份作业完成需要60分钟
单GPU运算：小明独自完成，那么需要240分钟
多GPU运算：小明先寻找小伙伴并平均分发作业需3分钟，并行运算60分钟，最后小明回收完成的作业3分钟，那么总共是66分钟

Pytorch中的多GPU分发并行机制：
首先将训练数据进行平均的分发，分发到每一个GPU上，然后每个GPU进行并行运算，得到运算结果后，再进行结果的回收，回收到主GPU上，也就是默认为可见gpu中的第一个gpu

3.2 torch.nn.DataParallel

torch.nn.DataParallel(module, 
					  device_ids=None,
					  output_device=None, 
					  dim=0)

功能：包装模型，实现分发并行机制

主要参数：
• module: 需要包装分发的模型
• device_ids: 可分发的gpu，默认分发到所有可见可用gpu
• output_device: 结果输出设备

# -*- coding: utf-8 -*-

import os
import numpy as np
import torch
import torch.nn as nn

# ============================ 手动选择gpu
# flag = 0
flag = 1
if flag:

    gpu_list = [0]
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# ============================ 依内存情况自动选择主gpu
# flag = 0
flag = 1
if flag:
    def get_gpu_memory():
        import platform
        if 'Windows' != platform.system():
            import os
            os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txt')
            memory_gpu = [int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()]
            os.system('rm tmp.txt')
        else:
            memory_gpu = False
            print("显存计算功能暂不支持windows操作系统")
        return memory_gpu


    gpu_memory = get_gpu_memory()
    if not gpu_memory:
        print("\ngpu free memory: {}".format(gpu_memory))
        gpu_list = np.argsort(gpu_memory)[::-1]

        gpu_list_str = ','.join(map(str, gpu_list))
        os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


class FooNet(nn.Module):
    def __init__(self, neural_num, layers=3):
        super(FooNet, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])

    def forward(self, x):

        print("\nbatch size in forward: {}".format(x.size()[0]))

        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.relu(x)
        return x


if __name__ == "__main__":

    batch_size = 16

    # data
    inputs = torch.randn(batch_size, 3)
    labels = torch.randn(batch_size, 3)

    inputs, labels = inputs.to(device), labels.to(device)

    # model
    net = FooNet(neural_num=3, layers=3)
    net = nn.DataParallel(net)           # 将自建模型通过nn.DataParallel进行包装
    net.to(device)

    # training
    for epoch in range(1):

        outputs = net(inputs)

        print("model outputs.size: {}".format(outputs.size()))

    print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
    print("device_count :{}".format(torch.cuda.device_count()))

在这里插入图片描述

左边是使用两个gpu的情况：
因为总的batch_size=16，所以每个gpu的batch size分别为8

右边是使用4个gpu的情况：
pytorch会将gpu的空闲内存进行排序，然后将空闲内存最多的gpu设置为主gpu
因为总的batch_size=16，所以每个gpu的batch size分别为4

3.3 查询当前gpu内存剩余

def get_gpu_memory():
	import os
	os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txt')
	memory_gpu = [int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()]
	os.system('rm tmp.txt')
	return memory_gpu

说明：
nvidia-smi ：英伟达的查询命令
-q：查询
-d：查询的内容
grep：搜索
-A4：显示当前行以及后4行
grep Free：内存剩余空间
> tmp.txt：将查询的信息重定向到指定临时的txt文件
rm tmp.txt：删除临时的txt文件

example:
gpu_memory = get_gpu_memory()                    		    # 获取gpu内存
gpu_list = np.argsort(gpu_memory)[::-1]				    	# 对gpu内存进行排序
gpu_list_str = ','.join(map(str, gpu_list))      		    # 将排序结果转成字符串
os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str) # 按顺序设置可见gpu
print("\ngpu free memory: {}".format(gpu_memory))
print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
>>> gpu free memory: [10362, 10058, 9990, 9990]
>>> CUDA_VISIBLE_DEVICES :0,1,3,2

3.4 gpu模型加载

3.4.1 常见报错1

在这里插入图片描述
模型在gpu上训练后，保存之后，在gpu不可用的设备上加载该模型，则会报错

代码示例：

# -*- coding: utf-8 -*-

import os
import numpy as np
import torch
import torch.nn as nn


class FooNet(nn.Module):
    def __init__(self, neural_num, layers=3):
        super(FooNet, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])

    def forward(self, x):

        print("\nbatch size in forward: {}".format(x.size()[0]))

        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.relu(x)
        return x


# =================================== 加载至cpu
# flag = 0
flag = 1
if flag:
    gpu_list = [0]
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    net = FooNet(neural_num=3, layers=3)
    net.to(device)

    # save
    net_state_dict = net.state_dict()
    path_state_dict = "./model_in_gpu_0.pkl"
    torch.save(net_state_dict, path_state_dict)

    # load
    state_dict_load = torch.load(path_state_dict)
    # state_dict_load = torch.load(path_state_dict, map_location="cpu")
    print("state_dict_load:\n{}".format(state_dict_load))

在这里插入图片描述
有gpu时，加载后，会显示到数据放到了GPU设备上

在这里插入图片描述
在设置了state_dict_load = torch.load(path_state_dict, map_location=“cpu”)在cpu上时，不会显示设备信息，因为pytorch只会当数据或模型在gpu上才会显示设备信息

3.4.2 常见报错2

在这里插入图片描述
训练时，采用了多GPU并行计算，也就是使用了torch.nn.DataParallel对模型进行了包装，因此，模型的网络层命名会多了一个module，所以导致加载state_dict时命名不匹配

代码示例：

# -*- coding: utf-8 -*-

import os
import numpy as np
import torch
import torch.nn as nn


class FooNet(nn.Module):
    def __init__(self, neural_num, layers=3):
        super(FooNet, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])

    def forward(self, x):

        print("\nbatch size in forward: {}".format(x.size()[0]))

        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.relu(x)
        return x


# =================================== 多gpu 保存
flag = 0
# flag = 1
if flag:

    if torch.cuda.device_count() < 2:
        print("gpu数量不足，请到多gpu环境下运行")
        import sys
        sys.exit(0)

    gpu_list = [0, 1, 2, 3]
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    net = FooNet(neural_num=3, layers=3)
    net = nn.DataParallel(net)
    net.to(device)

    # save
    net_state_dict = net.state_dict()
    path_state_dict = "./model_in_multi_gpu.pkl"
    torch.save(net_state_dict, path_state_dict)

# =================================== 多gpu 加载
flag = 0
# flag = 1
if flag:

    net = FooNet(neural_num=3, layers=3)

    path_state_dict = "./model_in_multi_gpu.pkl"
    state_dict_load = torch.load(path_state_dict, map_location="cpu")
    print("state_dict_load:\n{}".format(state_dict_load))

    # net.load_state_dict(state_dict_load)

    # remove module.
    from collections import OrderedDict
    new_state_dict = OrderedDict()
    for k, v in state_dict_load.items():
        namekey = k[7:] if k.startswith('module.') else k
        new_state_dict[namekey] = v
    print("new_state_dict:\n{}".format(new_state_dict))

    net.load_state_dict(new_state_dict)