在上一篇博文再探反向传播算法(推导)中,我们详细介绍了BP算法的由来及其详细推导过程。本篇博文将以手写体识别为例,分别用两个不同的目标函数(交叉熵和平方误差)来写出反向传播的源码(基于Python 3.x)。
1.网络结构
本例的数据集和网络设计结构,均来自coursera第五周的课后作业,关于手写体识别。该数据的维度为 ,即5000个样本(图片),维度为400。因此,网络的输入层为400个神经元,另外两个网络层的尺寸分别是,隐藏层25,输出层10(1~10中类型,其中10表示0)。网络结构图如下所示:
其中:
即对如上网络结构来说, , 表示第 层第 个神经元的激活值, 表示第 层的偏置。
根据再探反向传播算法(推导)中的推导过程,可以得出如下只含有一个样本时的梯度求解公式(不含正则化项):
2.平方误差目标函数
我们先用平方误差函数作为目标函数来构建网络,然后再使用交叉熵。平方误差的目标函数如下:
对于平方误差函数来说,
,且:
正向传播:
m, n = np.shape(X) # m:samples, n: dimensions
a1 = X.T # 400 by 5000
z2 = np.dot(W1, a1) + b1 # 25 by 400 dot 400 by 5000 + 25 by 1= 25 by 5000
a2 = sigmoid(z2) # 25 by 5000
z3 = np.dot(W2, a2) + b2 # 10 by 25 dot 25 by 5000 + 10 by 1= 10 by 5000
a3 = sigmoid(z3) # 10 by 5000
cost = (1/m)*np.sum((a3-y_label)**2)+(lambd/2)*(np.sum(W1**2)+np.sum(W2**2))
温馨提示:写代码的时候备注上维度,可以很方便检查错误
反向传播:
delta3=-(y_label-a3)*sigmoidGradient(z3)# 10 by 5000 第L层残差
df_w2=np.dot(delta3,a2.T)# 10 by 5000 dot 5000 by 25 = 10 by 25# J对w2的导数
df_w2=(1/m)*df_w2+lambd*W2# J对w2的导数 + 正则化项对于的J对w2的导数
delta2=np.dot(W2.T,delta3)*sigmoidGradient(z2)# 25 by 10 dot 10 by 5000 dot 25 by 5000= 25 by 5000
df_w1=np.dot(delta2,a1.T)# 25 by 5000 dot 5000 by 400 = 25 by 400
df_w1=(1/m)*df_w1+lambd*W1
df_b1=(1/m)*np.sum(delta2,axis=1).reshape(b1.shape)#
df_b2=(1/m)*np.sum(delta3,axis=1).reshape(b2.shape)
3.交叉熵目标函数
对于平方误差函数来说,
,且:
正向传播:
a1 = X.T # 400 by 5000
z2 = np.dot(W1, a1) + b1 # 25 by 400 * 400 by 5000 + 25 by 1= 25 by 5000
a2 = sigmoid(z2) # 25 by 5000
z3 = np.dot(W2, a2) + b2 # 10 by 25 * 25 by 5000 + 10 by 1= 10 by 5000
a3 = sigmoid(z3) # 10 by 5000
cost = (1/m)*np.sum(np.sum(-y_label*np.log(a3)-(1-y_label)*np.log(1-a3)))
cost = cost+(lambd / (2*m)) * (np.sum(W1 ** 2) + np.sum(W2 ** 2))
# 除了目标函数变了,其他均没变
反向传播:
delta3=(a3-y_label)
df_w2=np.dot(delta3,a2.T)# 10 by 5000 dot 5000 by 25 = 10 by 25
df_w2=(1/m)*df_w2+(lambd/m)*W2
delta2=np.dot(W2.T,delta3)*sigmoidGradient(z2)# 25 by 10 dot 10 by 5000 * 25 by 5000= 25 by 5000
df_w1=np.dot(delta2,a1.T)# 25 by 5000 dot 5000 by 400 = 25 by 400
df_w1=(1/m)*df_w1+(lambd/m)*W1
df_b1=(1/m)*np.sum(delta2,axis=1).reshape(b1.shape)#
df_b2=(1/m)*np.sum(delta3,axis=1).reshape(b2.shape)
4.其余部分代码说明
4.1 载入数据集
def loadData():
data=load.loadmat('ex4data1.mat')
X=data['X']# 5000 by 400 samples by dimensions
y=data['y'].reshape(5000)
eye=np.eye(10)
y_label=eye[:,y-1] # 10 by 5000# 矩阵化标签
ss=StandardScaler() # 标准化
X=ss.fit_transform(X)
return X,y,y_label
关于矩阵化标签,戳此处怎么把数据集的输出值转换成只含有0,1的标签向量?
由于Python中下标是从0开始索引,所以第6行代码中,y减去了1
4.2训练及保存参数
def train():
X, y, y_label=loadData()
m,n=np.shape(X)# m:samples, n: dimensions
input_layer_size=400
hidden_layer_size=25
output_layer_size=10
epsilong_init=0.12
W1=np.random.rand(hidden_layer_size,input_layer_size)*2*epsilong_init-epsilong_init
W2=np.random.rand(output_layer_size,hidden_layer_size)*2*epsilong_init-epsilong_init
b1=np.random.rand(hidden_layer_size,1)*2*epsilong_init-epsilong_init
b2=np.random.rand(output_layer_size,1)*2*epsilong_init-epsilong_init
#第8行到第12行的作用是: 使得初始化后每个参数取值的绝对值都小于0.12
for i in range(iteration):
arr = np.arange(5000)# 生成1-5000
np.random.shuffle(arr)# 随机打乱
index = arr[:500]
batch_X=X[index,:]# 取前500个构成一个batch
batch_y=y_label[:,index]
#第16行到第20行的作用是:每个随机取500个样本构成一个batch
c,df_w1, df_w2, df_b1, df_b2=costFandGradient(batch_X, batch_y, W1, b1, W2, b2, lambd)
cost.append(round(c,4))
W1, b1, W2, b2=gradientDescent(learn_rate,W1,b1,W2,b2,df_w1,df_w2,df_b1,df_b2)
p={'W1':W1,'b1':b1,'W2':W2,'b2':b2}
temp=open('data','wb')
pickle.dump(p,temp)
#第27行到第29行的作用是:以字典的形式保存参数,文件名为data
#也就是说,训练好一次之后,下次直接载入保存好的参数模型预测即可,不用再训练
4.2载入模型参数及预测
def prediction():
X, y, y_label = loadData()
p = open('data', 'rb')
data = pickle.load(p)
W1 = data['W1']
W2 = data['W2']
b1 = data['b1']
b2 = data['b2']
#第4行到第9行的作用是:载入模型参数,并赋值给相关变量
a1 = X.T # 400 by 5000
z2 = np.dot(W1, a1) + b1 # 25 by 400 * 400 by 5000 + 25 by 1= 25 by 5000
a2 = sigmoid(z2) # 25 by 5000
z3 = np.dot(W2, a2) + b2 # 10 by 25 * 25 by 5000 + 10 by 1= 10 by 5000
a3 = sigmoid(z3) # 10 by 5000
y_pre = np.zeros(a3.shape[1], dtype=int)
for i in range(a3.shape[1]):
col = a3[:, i]# 遍历5000个样本输出的预测
index = np.where(col == np.max(col))[0][0] + 1# 选择概率值最大的索引,由于Python从0开始索引,所以加了一个1
y_pre[i] = index
print(accuracy_score(y,y_pre))