Pytorch入门（自动求导机制、定义模型）——官方文档翻译

之前面试时被问到pytorch的一些基本知识，记录一下。基本来源于pytorch官方文档的翻译。
[pytorch官方文档
如果使用pytorch还是建议多读读官方文档，里面包含很多例子，上手会比较快

Tensor

tensor张量：n维矩阵，能够利用GPU加速

torch.from_numpy可以将numpy转换为张量

Autograd

自动求导，不需要手工去计算前向和反向传播。

当使用autograd时，会定义一个计算图，在图中的结点就是tensors，edges就是输入输出的函数

variable有三个比较重要的属性：data,grad和grad_fn，grad_fn表示的是得到这个variable的操作，比如是通过加减还是乘除来得到的，最后grad是这个variable的反向传播梯度。

y_pred = x.mm(w1).clamp(min=0).mm(w2)
loss = (y_pred - y).pow(2).sum()
print(t, loss.item())
loss.backward()

loss.backward()就是自动求导，不需要你再去明确写明哪个函数对哪个函数求导，直接通过这行代码就能对所有需要梯度的变量进行求导，得到它们的梯度

x.grad就可以得到x的梯度

上面是对标量求导，也可以直接对矩阵求导
y.backward(torch.FloatTensor([1,0.1,0.01]))

也可以自定义autograd

定义新的autograd功能

每个原始autograd运算符实际上是两个在tensors上运行的函数：前向和后向函数

前向由输入计算输出，后向为计算梯度

通过torch.autograd.Function可以定义自己的autograd

class MyReLU(torch.autograd.Function):
    """
   通过autograd Functions by子类
    torch.autograd.Function 可以自定义前向
    和后向传播函数
    """

    @staticmethod
    def forward(ctx, input):
        """
        在前向传播中，我们接受一个输入张量，返回一个包含输出的张量，ctx是一个上下文对象，隐藏信息可用于反向传播计算
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        反向是计算梯度
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

动态图

与TensorFlow的静态图不一样，pytorch使用的是动态图

tf中定义一个计算图后，每次重复执行这个计算图，pytorch中每次前向传播都会定义一个新的计算图

动态图和静态图的一个区别是控制流，对于有些模型，我们希望在不同的数据点进行不同的计算，比如RNN可能可能每个数据点的时间步长不一样，这种展开可以被执行为一个循环。对于静态图，循环结构需要在计算图中，TF提供了tf.scan来实现在计算图中嵌入循环，但是动态图中更加简单，因为我们为每个示例动态构建计算图，所以我们可以使用常规命令流控制来执行每个输入不同的计算。

nn模块

在实际的运用中，如果每次都自己定义autograd会比较麻烦，实际上可以直接使用一些nn模块

# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

在我们继续之前，我们通过调用 .zero() 方法将梯度重置为零。我们需要这么做的原因是 PyTorch 会累积梯度，也就是说，我们下一次在损失上调用 .backward 时，新的梯度值会被加到已有的梯度值上，这可能会导致意外结果出现。

optim

到目前为止我们都是通过手工改变张量的可学习参数来更新模型的权重（通过torch.no_grad()以及.data去避免在自动求导时跟踪历史。

当模型较为复杂时，直接使用一些优化算法会更好：like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like AdaGrad, RMSProp, Adam, etc.

# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# 定义一个优化函数，可以自动帮我们更新梯度 Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.

learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # 在反向传播前，使用优化器将所有要更新的变量的
    # 的梯度归零
    # 这是因为默认情况下，只要调用.backward()梯度就会在缓冲区积累
    optimizer.zero_grad()

    # 通过模型参数反向传播计算梯度
    loss.backward()

    # 优化函数自动更新权重
    optimizer.step()

自定义nn模块

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred
        
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

定义一个forward

比如这个GRU单元

class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding

        # Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
        #   because our input size is a word embedding with number of features == hidden_size
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
                          dropout=(0 if n_layers == 1 else dropout), bidirectional=True)

    def forward(self, input_seq, input_lengths, hidden=None):
        # Convert word indexes to embeddings
        embedded = self.embedding(input_seq)
        # Pack padded batch of sequences for RNN module
        packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        # Forward pass through GRU
        outputs, hidden = self.gru(packed, hidden)
        # Unpack padding
        outputs, _ = torch.nn.utils.rnn.pad_packed_sequence(outputs)
        # Sum bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
        # Return output and final hidden state
        return outputs, hidden

控制流与权重共享

当一个模型需要随机选择1-4这些隐含层，重复使用相同的权重时，可以在pytorch的动态图中很轻松的实现，可以直接通过python的循环写法来实现，比tensorflow要简便


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred

训练模型

训练好模型后，就遵循实现梯度下降的同一过程：

生成预测
计算损失
根据权重和偏置计算梯度
按比例减去少量梯度来调整权重
将梯度重置为零