在学习pytorch
的时候,看见了标题的东西,就简单记录下自己的理解。
视频地址:here
时间:66:32
假定建立的神经网络当前没有偏置b
,那么神经网络的结构构成如下:
h i d d e n = R e L U ( x ∗ w 1 ) y ^ = h i d d e n ∗ w 2 l o s s = ( y ^ − y ) 2 \begin{aligned} & hidden = ReLU( x*w_1) \\ & \hat y = hidden*w_2 \\ &loss = (\hat y - y)^2 \end{aligned} hidden=ReLU(x∗w1)y^=hidden∗w2loss=(y^−y)2
不妨稍加整理,其损失函数可以整理为:
l o s s = ( R e L U ( x ∗ w 1 ) ∗ w 2 − y ) 2 = ( R e L U ( X W 1 ) W 2 − Y ) T ( R e L U ( X W 1 ) W 2 − Y ) \begin{aligned} loss &= (ReLU(x *w_1) * w_2 - y) ^ 2 \\ &= (ReLU(XW_1)W_2-Y)^T(ReLU(XW_1)W_2-Y) \end{aligned} loss=(ReLU(x∗w1)∗w2−y)2=(ReLU(XW1)W2−Y)T(ReLU(XW1)W2−Y)
那么,对其 w 1 w_1 w1求偏导,
∂ l o s s ∂ w 1 = ∂ Z T Z ∂ w 1 = ∂ Z T Z ∂ Z ∗ ∂ Z ∂ W 1 = 2 Z ∗ ∂ ( R e L U ( X W 1 ) W 2 − Y ) ∂ W 1 = 2 ( Y ^ − Y ) ∗ ∂ ( X W 1 W 2 − Y ) ∂ W 1 . w h e n X W 1 ⩾ 0 = 2 X T ( Y ^ − Y ) ∗ W 2 T . w h e n X W 1 ⩾ 0 \begin{aligned} \frac{\partial loss}{\partial w_1} &= \frac{\partial Z^TZ}{\partial w_1} \\ &= \frac{\partial Z^TZ}{\partial Z} * \frac{\partial Z}{\partial W_1} \\ &= 2Z* \frac{\partial (ReLU(XW_1)W_2-Y)}{\partial W_1} \\ &= 2(\hat Y-Y) * \frac{\partial (XW_1W_2-Y)}{\partial W_1}. \ \ \ \ \ when \ XW_1 \geqslant 0 \\ &= 2X^T(\hat Y-Y) * W_2^T. \ \ \ \ \ when \ XW_1 \geqslant 0 \end{aligned} ∂w1∂loss=∂w1∂ZTZ=∂Z∂ZTZ∗∂W1∂Z=2Z∗∂W1∂(ReLU(XW1)W2−Y)=2(Y^−Y)∗∂W1∂(XW1W2−Y). when XW1⩾0=2XT(Y^−Y)∗W2T. when XW1⩾0
同样的,有:
∂ l o s s ∂ w 2 = 2 W 1 T X T ( Y ^ − Y ) \begin{aligned} \frac{\partial loss}{\partial w_2} &= 2 W_1^TX^T(\hat Y - Y) \end{aligned} ∂w2∂loss=2W1TXT(Y^−Y)
这里,就简单的将 X W 1 ⩾ 0 XW_1 \geqslant 0 XW1⩾0,不考虑。也就是在计算梯度的时候,不考虑 R e L U ReLU ReLU这个分段函数,简单的统一处理,也就是在梯度计算的时候,默认没有加入该激活函数,以简化问题。
但是,值得注意的是,在计算 ( Y ^ − Y ) (\hat Y - Y) (Y^−Y)时候,还是添加上 R e L U ReLU ReLU,经实践,可以加快收敛。
import numpy as np
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
# loss = (y_pred - y) ** 2
grad_y_pred = 2.0 * (y_pred - y)
#
# grad_w2 = h_relu.T.dot(grad_y_pred)
# grad_h_relu = grad_y_pred.dot(w2.T)
# grad_h = grad_h_relu.copy()
# grad_h[h < 0] = 0
# grad_w1 = x.T.dot(grad_h)
grad_w1 = 2 * x.T.dot(y_pred - y).dot(w2.T)
grad_w2 = 2 * w1.T.dot(x.T).dot(y_pred - y)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
【注】,改代码来自上面链接中视频中的代码。本文修改的部分仅仅是梯度计算的这里。仅是作为简单的理解和记录。