之前面试时被问到pytorch的一些基本知识,记录一下。基本来源于pytorch官方文档的翻译。
[pytorch官方文档
如果使用pytorch还是建议多读读官方文档,里面包含很多例子,上手会比较快
Tensor
tensor张量:n维矩阵,能够利用GPU加速
torch.from_numpy可以将numpy转换为张量
Autograd
自动求导,不需要手工去计算前向和反向传播。
当使用autograd时,会定义一个计算图,在图中的结点就是tensors,edges就是输入输出的函数
variable有三个比较重要的属性:data,grad和grad_fn,grad_fn表示的是得到这个variable的操作,比如是通过加减还是乘除来得到的,最后grad是这个variable的反向传播梯度。
y_pred = x.mm(w1).clamp(min=0).mm(w2)
loss = (y_pred - y).pow(2).sum()
print(t, loss.item())
loss.backward()
loss.backward()就是自动求导,不需要你再去明确写明哪个函数对哪个函数求导,直接通过这行代码就能对所有需要梯度的变量进行求导,得到它们的梯度
x.grad就可以得到x的梯度
上面是对标量求导,也可以直接对矩阵求导
y.backward(torch.FloatTensor([1,0.1,0.01]))
也可以自定义autograd
定义新的autograd功能
每个原始autograd运算符实际上是两个在tensors上运行的函数:前向和后向函数
前向由输入计算输出,后向为计算梯度
通过torch.autograd.Function可以定义自己的autograd
class MyReLU(torch.autograd.Function):
"""
通过autograd Functions by子类
torch.autograd.Function 可以自定义前向
和后向传播函数
"""
@staticmethod
def forward(ctx, input):
"""
在前向传播中,我们接受一个输入张量,返回一个包含输出的张量,ctx是一个上下文对象,隐藏信息可用于反向传播计算
"""
ctx.save_for_backward(input)
return input.clamp(min=0)
@staticmethod
def backward(ctx, grad_output):
"""
反向是计算梯度
"""
input, = ctx.saved_tensors
grad_input = grad_output.clone()
grad_input[input < 0] = 0
return grad_input
动态图
与TensorFlow的静态图不一样,pytorch使用的是动态图
tf中定义一个计算图后,每次重复执行这个计算图,pytorch中每次前向传播都会定义一个新的计算图
动态图和静态图的一个区别是控制流,对于有些模型,我们希望在不同的数据点进行不同的计算,比如RNN可能可能每个数据点的时间步长不一样,这种展开可以被执行为一个循环。对于静态图,循环结构需要在计算图中,TF提供了tf.scan来实现在计算图中嵌入循环,但是动态图中更加简单,因为我们为每个示例动态构建计算图,所以我们可以使用常规命令流控制来执行每个输入不同的计算。
nn模块
在实际的运用中,如果每次都自己定义autograd会比较麻烦,实际上可以直接使用一些nn模块
# -*- coding: utf-8 -*-
import torch
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(500):
# Forward pass: compute predicted y by passing x to the model. Module objects
# override the __call__ operator so you can call them like functions. When
# doing so you pass a Tensor of input data to the Module and it produces
# a Tensor of output data.
y_pred = model(x)
# Compute and print loss. We pass Tensors containing the predicted and true
# values of y, and the loss function returns a Tensor containing the
# loss.
loss = loss_fn(y_pred, y)
print(t, loss.item())
# Zero the gradients before running the backward pass.
model.zero_grad()
# Backward pass: compute gradient of the loss with respect to all the learnable
# parameters of the model. Internally, the parameters of each Module are stored
# in Tensors with requires_grad=True, so this call will compute gradients for
# all learnable parameters in the model.
loss.backward()
# Update the weights using gradient descent. Each parameter is a Tensor, so
# we can access its gradients like we did before.
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
在我们继续之前,我们通过调用 .zero() 方法将梯度重置为零。我们需要这么做的原因是 PyTorch 会累积梯度,也就是说,我们下一次在损失上调用 .backward 时,新的梯度值会被加到已有的梯度值上,这可能会导致意外结果出现。
optim
到目前为止我们都是通过手工改变张量的可学习参数来更新模型的权重(通过torch.no_grad()以及.data去避免在自动求导时跟踪历史。
当模型较为复杂时,直接使用一些优化算法会更好:like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like AdaGrad, RMSProp, Adam, etc.
# -*- coding: utf-8 -*-
import torch
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
# 定义一个优化函数,可以自动帮我们更新梯度 Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(x)
# Compute and print loss.
loss = loss_fn(y_pred, y)
print(t, loss.item())
# 在反向传播前,使用优化器将所有要更新的变量的
# 的梯度归零
# 这是因为默认情况下,只要调用.backward()梯度就会在缓冲区积累
optimizer.zero_grad()
# 通过模型参数反向传播计算梯度
loss.backward()
# 优化函数自动更新权重
optimizer.step()
自定义nn模块
class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we instantiate two nn.Linear modules and assign them as
member variables.
"""
super(TwoLayerNet, self).__init__()
self.linear1 = torch.nn.Linear(D_in, H)
self.linear2 = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
In the forward function we accept a Tensor of input data and we must return
a Tensor of output data. We can use Modules defined in the constructor as
well as arbitrary operators on Tensors.
"""
h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)
定义一个forward
比如这个GRU单元
class EncoderRNN(nn.Module):
def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
super(EncoderRNN, self).__init__()
self.n_layers = n_layers
self.hidden_size = hidden_size
self.embedding = embedding
# Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
# because our input size is a word embedding with number of features == hidden_size
self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
dropout=(0 if n_layers == 1 else dropout), bidirectional=True)
def forward(self, input_seq, input_lengths, hidden=None):
# Convert word indexes to embeddings
embedded = self.embedding(input_seq)
# Pack padded batch of sequences for RNN module
packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
# Forward pass through GRU
outputs, hidden = self.gru(packed, hidden)
# Unpack padding
outputs, _ = torch.nn.utils.rnn.pad_packed_sequence(outputs)
# Sum bidirectional GRU outputs
outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
# Return output and final hidden state
return outputs, hidden
控制流与权重共享
当一个模型需要随机选择1-4这些隐含层,重复使用相同的权重时,可以在pytorch的动态图中很轻松的实现,可以直接通过python的循环写法来实现,比tensorflow要简便
class DynamicNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we construct three nn.Linear instances that we will use
in the forward pass.
"""
super(DynamicNet, self).__init__()
self.input_linear = torch.nn.Linear(D_in, H)
self.middle_linear = torch.nn.Linear(H, H)
self.output_linear = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
and reuse the middle_linear Module that many times to compute hidden layer
representations.
Since each forward pass builds a dynamic computation graph, we can use normal
Python control-flow operators like loops or conditional statements when
defining the forward pass of the model.
Here we also see that it is perfectly safe to reuse the same Module many
times when defining a computational graph. This is a big improvement from Lua
Torch, where each Module could be used only once.
"""
h_relu = self.input_linear(x).clamp(min=0)
for _ in range(random.randint(0, 3)):
h_relu = self.middle_linear(h_relu).clamp(min=0)
y_pred = self.output_linear(h_relu)
return y_pred
训练模型
训练好模型后,就遵循实现梯度下降的同一过程:
- 生成预测
- 计算损失
- 根据权重和偏置计算梯度
- 按比例减去少量梯度来调整权重
- 将梯度重置为零