深度学习 | 吴恩达序列模型专项课程第一周编程作业实验1 --- 使用NumPy逐步构建RNN，LSTM

吴恩达深度学习专项课程的所有实验均采用iPython Notebooks实现，不熟悉的朋友可以提前使用一下Notebooks。本次实验中我们将使用NumPy一步步构建一个朴素RNN网络和LSTM网络。

完整代码(中文翻译版带答案)

1.实验综述

2.导入必要的包

3.朴素(基础)RNN单元的前行传播

4.LSTM(Long Short-Term Memory)网络

5.RNN的反向传播(选做)

6.LSTM反向传播(选做)

1.实验综述

2.导入必要的包

import numpy as np
#rnn_utils.py中定义了 本次试验需要的辅助函数
from rnn_utils import *

3.朴素(基础)RNN单元的前行传播

朴素RNN单元

# GRADED FUNCTION: rnn_cell_forward

def rnn_cell_forward(xt, a_prev, parameters):
    """
    实现 Figure(2)中描述的在一个时间步骤上朴素RNN单元的计算过程。

    Arguments:
    xt -- 在时间步骤t的输入, 维度 (n_x, m)，同时处理m个样本，每个样本在一个时间步骤的输入是一个向量(n_x,1).
    a_prev -- 第"t-1"个时间步骤输出的隐藏状态, 维度 (n_a, m)
    parameters -- Python字典包含:
                        Wax -- 与输入相乘的权重矩阵, 维度 (n_a, n_x)
                        Waa -- 与之前隐藏状态相乘的权重矩阵, 维度 (n_a, n_a)
                        Wya -- 与当前隐藏状态相乘用于产生输出的权重矩阵, 维度 (n_y, n_a)
                        ba --  计算当前隐藏状态的偏置参数  维度 (n_a, 1)
                        by --  计算当前输出的偏置参数  维度 (n_y, 1)
    Returns:
    a_next -- 当前的隐藏状态 (n_a, m)
    yt_pred -- 当前的输出 (n_y, m)
    cache -- 元组形式包括(a_next, a_prev, xt, parameters)，用于与反向传播共享参数
    """
    
    # 取出参数
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]
    
   
    #计算当前时间步骤的隐藏状态
    a_next = np.tanh(Wax.dot(xt) + Waa.dot(a_prev) + ba)
    # 计算当前时间步骤的输出
    yt_pred = softmax(Wya.dot(a_next) + by)
    
    #存储重要参数  与反向传播共享
    cache = (a_next, a_prev, xt, parameters)
    
    return a_next, yt_pred, cache

RNN前行传播

# GRADED FUNCTION: rnn_forward

def rnn_forward(x, a0, parameters):
    """
    实现 Figure (3)描述的RNN(朴素单元)的前向传播.

    Arguments:
    x -- 所有时间步骤的输入数据, 维度 (n_x, m, T_x) m个样本(序列)同时处理，每个样本(序列)有Tx个时间步骤(每个样本(序列)的长度相同)
         每个样本在一个时间步骤上输入的是一个(n_x,1)向量.
         实际应用中为了能够向量化处理，会将m个样本/序列统一为一个固定的长度，长序列截断，短序列填充。或者这个固定长度直接
         设置为最长序列的长度，短序列都进行填充以达到这个长度。
    a0 -- m个样本(序列)的初始隐藏状态  (n_a, m)
    parameters -- Python字典包含:
                        Wax -- 与输入相乘的权重矩阵, 维度 (n_a, n_x)
                        Waa -- 与之前隐藏状态相乘的权重矩阵, 维度 (n_a, n_a)
                        Wya -- 与当前隐藏状态相乘用于产生输出的权重矩阵, 维度 (n_y, n_a)
                        ba --  计算当前隐藏状态的偏置参数  维度 (n_a, 1)
                        by --  计算当前输出的偏置参数  维度 (n_y, 1)

    Returns:
    a -- 所有时间步骤上m个样本(序列)的隐藏状态, 维度 (n_a, m, T_x)
    y_pred -- 所有时间步骤上m个样本(序列)的预测输出  (n_y, m, T_x)
    caches -- 元组列表  包含(list of caches, x) 与反向传播共享参数
    """
    
    # 存储所有的cache
    caches = []
    
    # 得到输入x和参数Wya的形状
    n_x, m, T_x = x.shape
    n_y, n_a = parameters["Wya"].shape
    
    
    # 用0初始化 a和y_pred
    a = np.zeros((n_a,m,T_x))
    y_pred = np.zeros((n_y,m,T_x))
    
    #初始化a_next
    a_next = a0
    
    #遍历m个样本/序列的每个时间步骤
    for t in range(T_x):
        # 更新 ”下一个" 隐藏状态, 计算当前时间步骤上的预测输出, 得到cache
        a_next, yt_pred, cache = rnn_cell_forward(x[:,:,t],a_next,parameters)
        #保存当前时间步骤计算的隐藏状态 即"下一个"隐藏状态
        a[:,:,t] = a_next
        #保存当前时间步骤计算的预测输出
        y_pred[:,:,t] = yt_pred
        
        caches.append(cache)
     
        

    caches = (caches, x)
    
    return a, y_pred, caches

4.LSTM(Long Short-Term Memory)网络

关于LSTM中的门

LSTM单元

# GRADED FUNCTION: lstm_cell_forward

def lstm_cell_forward(xt, a_prev, c_prev, parameters):
    """
    实现 Figure (4)中描述的一个时间步骤上LSTM单元的前向传播

    Arguments:
    xt -- 时间步骤t的输入数据，m个样本/序列向量化处理 (n_x, m).
    a_prev --时间步骤t-1的隐藏状态,  (n_a, m)
    c_prev -- 时间步骤t-1的细胞状态,  (n_a, m)
    parameters -- Python字典 包含:
                        Wf -- 遗忘门权重参数矩阵 (n_a, n_a + n_x)
                        bf -- 遗忘门偏置参数向量 (n_a, 1)
                        Wi -- 更新门权重参数矩阵 (n_a, n_a + n_x)
                        bi -- 更新门偏置参数向量 (n_a, 1)
                        Wc -- 第一个 "tanh"的权重参数矩阵,  (n_a, n_a + n_x)
                        bc --  第一个 "tanh"的偏置参数向量,  (n_a, 1)
                        Wo -- 输出门权重参数矩阵 (n_a, n_a + n_x)
                        bo -- 输出门偏置参数向量 (n_a, 1)
                        Wy -- 与隐藏状态计算输出有关的权重参数矩阵,  (n_y, n_a)
                        by -- 与隐藏状态计算输出有关的偏置参数向量,  (n_y, 1)
                        
    Returns:
    a_next -- 当前单元计算出的隐藏状态，并传递到下一个单元 (n_a, m)
    c_next -- 当前单元计算出的细胞状态，并传递到下一个单元 (n_a, m)
    yt_pred -- 在时间步骤t 计算的预测输出 (n_y, m)
    cache -- 元组形式 与反向传播共享参数, 包含 (a_next, c_next, a_prev, c_prev, xt, parameters)
    
    Note: ft/it/ot 代表 forget/update/output gates, cct 代表候选值 (c tilde),
          c 代表 记忆(单元)状态
    """

    # 取出参数
    Wf = parameters["Wf"]
    bf = parameters["bf"]
    Wi = parameters["Wi"]
    bi = parameters["bi"]
    Wc = parameters["Wc"]
    bc = parameters["bc"]
    Wo = parameters["Wo"]
    bo = parameters["bo"]
    Wy = parameters["Wy"]
    by = parameters["by"]
    
    # 得到 xt 和 Wy的维度
    n_x, m = xt.shape # m为样本(序列)数  n_x为样本在一个时间步骤上输入的长度(嵌入向量长度)
    n_y, n_a = Wy.shape #n_a LSTM单元中 隐藏单元的数量; n_y预测输出的长度(输出层单元数)


    # 合并 a_prev and xt 
    concat = np.zeros((n_a+n_x,m))
    concat[: n_a, :] = a_prev
    concat[n_a :, :] = xt

    # 计算 ft, it, cct, c_next, ot, a_next 
    ft = sigmoid(Wf.dot(concat) + bf) #(n_a,m)
    it = sigmoid(Wi.dot(concat) + bi) #(n_a,m)
    cct = np.tanh(Wc.dot(concat) + bc) #(n_a,m)
    c_next = ft*c_prev + it*cct
    ot = sigmoid(Wo.dot(concat) + bo) #(n_a,m)
    a_next = ot*np.tanh(c_next) #(n_a,m)
    
    # 计算当前LSTM单元的预测输出 
    yt_pred = softmax(Wy.dot(a_next) + by) #(n_y,m)


    cache = (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters)

    return a_next, c_next, yt_pred, cache

LSTM的前行传播

# GRADED FUNCTION: lstm_forward

def lstm_forward(x, a0, parameters):
    """
    实现 Figure (3)中描述的使用LSTM单元的RNN的前向传播.

    Arguments:
    x -- 所有时间步骤的输入数据, 维度 (n_x, m, T_x) m个样本(序列)同时处理，每个样本(序列)有Tx个时间步骤(每个样本(序列)的长度相同)
         每个样本在一个时间步骤上输入的是一个(n_x,1)向量.
         实际应用中为了能够向量化处理，会将m个样本/序列统一为一个固定的长度，长序列截断，短序列填充。或者这个固定长度直接
         设置为最长序列的长度，短序列都进行填充以达到这个长度。
         
    a0 -- m个样本(序列)的初始隐藏状态  (n_a, m)
    parameters -- Python字典 包含:
                        Wf -- 遗忘门权重参数矩阵 (n_a, n_a + n_x)
                        bf -- 遗忘门偏置参数向量 (n_a, 1)
                        Wi -- 更新门权重参数矩阵 (n_a, n_a + n_x)
                        bi -- 更新门偏置参数向量 (n_a, 1)
                        Wc -- 第一个 "tanh"的权重参数矩阵,  (n_a, n_a + n_x)
                        bc --  第一个 "tanh"的偏置参数向量,  (n_a, 1)
                        Wo -- 输出门权重参数矩阵 (n_a, n_a + n_x)
                        bo -- 输出门偏置参数向量 (n_a, 1)
                        Wy -- 与隐藏状态计算输出有关的权重参数矩阵,  (n_y, n_a)
                        by -- 与隐藏状态计算输出有关的偏置参数向量,  (n_y, 1)
                        
                        
    Returns:
    a:所有时间步骤上m个样本(序列)的隐藏状态, 维度 (n_a, m, T_x)
    y_pred -- 所有时间步骤上m个样本(序列)的预测输出  (n_y, m, T_x)
    caches -- 元组列表  包含(list of caches, x) 与反向传播共享参数
    """

   
    caches = []
    
    #得到x和 参数Wy的维度
    n_x, m, T_x = x.shape
    n_y, n_a = Wy.shape
    
    # 初始化 a,c,y (zero)
    a = np.zeros((n_a,m,T_x))
    c = np.zeros((n_a,m,T_x))
    y = np.zeros((n_y,m,T_x))
    
    # 初始化a_next c_next
    a_next = a0
    c_next = np.zeros((n_a,m))
    
    # 迭代所有的时间步骤
    for t in range(T_x):
        # 当前时间步骤上LSTM单元的计算结果 当前单元的隐藏状态(传到下一单元)，当前细胞状态(传到下一单元),cache,当前单元的预测输出
        a_next, c_next, yt, cache = lstm_cell_forward(x[:,:,t],a_next,c_next,parameters)
        # 在a中存储当前时间步骤计算的新的隐藏状态
        a[:,:,t] = a_next
        # 在y中存储当前时间步骤计算的预测输出
        y[:,:,t] = yt
        # 在c中存储当前时间步骤计算的新的细胞状态
        c[:,:,t]  = c_next
        # Append the cache into caches (≈1 line)
        caches.append(cache)
        

    caches = (caches, x)

    return a, y, c, caches

5.RNN的反向传播(选做)

朴素RNN单元的反向传播

def rnn_cell_backward(da_next, cache):
    """
    实现朴素RNN单元的反向传播(在一个时间步骤上)
    
    Arguments:
    da_next -- loss相对于a_next的梯度
    cache -- Python字典 包含前向传播缓存的重要参数 (output of rnn_cell_forward())

    Returns:
    gradients --Python字典包含:
                        dx -- loss相对于输入数据x的梯度, 与x维度相同 (n_x, m)
                        da_prev -- loss相对于a_prev的梯度 与a_prev维度相同 (n_a, m)
                        dWax -- loss相对于输入和隐藏状态之间的权重矩阵Wax的梯度  维度与Wax相同 (n_a, n_x)
                        dWaa --loss 相对于隐藏状态和隐藏状态之间的权重矩阵Waa的梯度,维度与Waa相同 (n_a, n_a)
                        dba -- loss相对于偏置参数的梯度, 维度与ba相同 (n_a, 1)
    """
    
    # 取出前向传播缓存的结果
    (a_next, a_prev, xt, parameters) = cache
    
    # 取出参数
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]


    # 计算a_next相对于tanh的梯度
    dtanh = (1 - a_next**2)*da_next  #(n_a,m)

    # 计算loss对于xt和Wax的梯度 
    dxt =  np.dot(Wax.T,dtanh) #(n_x,m)
    dWax = np.dot(dtanh,xt.T)  #(n_a,n_x)

    # 计算loss对于da_prev 和 dWaa的梯度 (≈2 lines)
    da_prev = np.dot(Waa.T,dtanh) #(n_a,m)
    dWaa = np.dot(dtanh,a_prev.T) #(n_a,n_a)

    # 计算loss对于ba的梯度
    dba = np.sum(dtanh,axis=1,keepdims=True)


    gradients = {"dxt": dxt, "da_prev": da_prev, "dWax": dWax, "dWaa": dWaa, "dba": dba}
    
    return gradients

朴素RNN的反向传播

def rnn_backward(da, caches):
    """
    实现RNN(使用朴素单元)在整个输入数据序列上的反向传播

    Arguments:
    da -- 所有隐藏状态的梯度, (n_a, m, T_x)
    caches -- 前向传播缓存的结果 (rnn_forward)
    
    Returns:
    gradients -- Python字典包含
                        dx -- Gradient w.r.t. the input data, numpy-array of shape (n_x, m, T_x)
                        da0 -- Gradient w.r.t the initial hidden state, numpy-array of shape (n_a, m)
                        dWax -- Gradient w.r.t the input's weight matrix, numpy-array of shape (n_a, n_x)
                        dWaa -- Gradient w.r.t the hidden state's weight matrix, numpy-arrayof shape (n_a, n_a)
                        dba -- Gradient w.r.t the bias, of shape (n_a, 1)
    """
        
    
    # Retrieve values from the first cache (t=1) of caches (≈2 lines)
    (caches, x) = caches
    (a1, a0, x1, parameters) = caches[0]
    
    # da 和 x1的维度
    n_a, m, T_x = da.shape
    n_x, m = x1.shape
    
    # 除初始化梯度  注意维度
    dx = np.zeros((n_x,m,T_x))
    dWax = np.zeros((n_a,n_x))
    dWaa = np.zeros((n_a,n_a))
    dba = np.zeros((n_a,1))
    da0 = np.zeros((n_a,m))
    da_prevt = np.zeros((n_a,m))
    
    # 遍历所有时间步骤
    for t in reversed(range(T_x)):
        # Compute gradients at time step t. Choose wisely the "da_next" and the "cache" to use in the backward propagation step. (≈1 line)
        gradients = rnn_cell_backward(da[:, :, t] + da_prevt, caches[t])
        # Retrieve derivatives from gradients (≈ 1 line)
        dxt, da_prevt, dWaxt, dWaat, dbat = gradients["dxt"], gradients["da_prev"], gradients["dWax"], gradients["dWaa"], gradients["dba"]
        # Increment global derivatives w.r.t parameters by adding their derivative at time-step t (≈4 lines)
        dx[:, :, t] = dxt
        dWax += dWaxt
        dWaa += dWaat
        dba += dbat
        
    # Set da0 to the gradient of a which has been backpropagated through all time-steps (≈1 line) 
    da0 = da_prevt

    gradients = {"dx": dx, "da0": da0, "dWax": dWax, "dWaa": dWaa,"dba": dba}
    
    return gradients

6.LSTM反向传播(选做)

一步反向传播

def lstm_cell_backward(da_next, dc_next, cache):
    """
    实现LSTM单元的反向传播(在一个时间步骤上)

    Arguments:
    da_next -- Gradients of next hidden state, of shape (n_a, m)
    dc_next -- Gradients of next cell state, of shape (n_a, m)
    cache -- cache storing information from the forward pass

    Returns:
    gradients -- python dictionary containing:
                        dxt -- Gradient of input data at time-step t, of shape (n_x, m)
                        da_prev -- Gradient w.r.t. the previous hidden state, numpy array of shape (n_a, m)
                        dc_prev -- Gradient w.r.t. the previous memory state, of shape (n_a, m, T_x)
                        dWf -- Gradient w.r.t. the weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        dWi -- Gradient w.r.t. the weight matrix of the input gate, numpy array of shape (n_a, n_a + n_x)
                        dWc -- Gradient w.r.t. the weight matrix of the memory gate, numpy array of shape (n_a, n_a + n_x)
                        dWo -- Gradient w.r.t. the weight matrix of the save gate, numpy array of shape (n_a, n_a + n_x)
                        dbf -- Gradient w.r.t. biases of the forget gate, of shape (n_a, 1)
                        dbi -- Gradient w.r.t. biases of the update gate, of shape (n_a, 1)
                        dbc -- Gradient w.r.t. biases of the memory gate, of shape (n_a, 1)
                        dbo -- Gradient w.r.t. biases of the save gate, of shape (n_a, 1)
    """

    # Retrieve information from "cache"
    (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters) = cache
    

    #  xt和a_next的维度
    n_x, m = xt.shape
    n_a, m = a_next.shape
    
    # 计算与门相关的导数 公式(7)-(11)
    dot = da_next * np.tanh(c_next) * ot * (1 - ot)
    dcct = (dc_next * it + ot * (1 - np.square(np.tanh(c_next))) * it * da_next) * (1 - np.square(cct))
    dit = (dc_next * cct + ot * (1 - np.square(np.tanh(c_next))) * cct * da_next) * it * (1 - it)
    dft = (dc_next * c_prev + ot * (1 - np.square(np.tanh(c_next))) * c_prev * da_next) * ft * (1 - ft)
    
    

    # 计算与参数相关的梯度.  公式(11)-(14) 
    concat = np.concatenate((a_prev, xt), axis=0).T
    dWf = np.dot(dft, concat)
    dWi = np.dot(dit, concat)
    dWc = np.dot(dcct, concat)
    dWo = np.dot(dot, concat)
    dbf = np.sum(dft, axis=1, keepdims=True)  
    dbi = np.sum(dit, axis=1, keepdims=True)  
    dbc = np.sum(dcct, axis=1, keepdims=True)  
    dbo = np.sum(dot, axis=1, keepdims=True) 

    # 计算cost对应之前隐藏状态、记忆状态和输入的导数.公式 (15)-(17). 
    da_prev = np.dot(parameters["Wf"][:, :n_a].T, dft) + np.dot(parameters["Wc"][:, :n_a].T, dcct) + np.dot(parameters["Wi"][:, :n_a].T, dit) + np.dot(parameters["Wo"][:, :n_a].T, dot)
    dc_prev = dc_next * ft + ot * (1-np.square(np.tanh(c_next))) * ft * da_next
    dxt = np.dot(parameters["Wf"][:, n_a:].T, dft) + np.dot(parameters["Wc"][:, n_a:].T, dcct) + np.dot(parameters["Wi"][:, n_a:].T, dit) + np.dot(parameters["Wo"][:, n_a:].T, dot)
 
    
    # Save gradients in dictionary
    gradients = {"dxt": dxt, "da_prev": da_prev, "dc_prev": dc_prev, "dWf": dWf,"dbf": dbf, "dWi": dWi,"dbi": dbi,
                "dWc": dWc,"dbc": dbc, "dWo": dWo,"dbo": dbo}

    return gradients

LSTM的反向传播

def lstm_backward(da, caches):
    
    """
    Implement the backward pass for the RNN with LSTM-cell (over a whole sequence).

    Arguments:
    da -- Gradients w.r.t the hidden states, numpy-array of shape (n_a, m, T_x)
    dc -- Gradients w.r.t the memory states, numpy-array of shape (n_a, m, T_x)
    caches -- cache storing information from the forward pass (lstm_forward)

    Returns:
    gradients -- python dictionary containing:
                        dx -- Gradient of inputs, of shape (n_x, m, T_x)
                        da0 -- Gradient w.r.t. the previous hidden state, numpy array of shape (n_a, m)
                        dWf -- Gradient w.r.t. the weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        dWi -- Gradient w.r.t. the weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                        dWc -- Gradient w.r.t. the weight matrix of the memory gate, numpy array of shape (n_a, n_a + n_x)
                        dWo -- Gradient w.r.t. the weight matrix of the save gate, numpy array of shape (n_a, n_a + n_x)
                        dbf -- Gradient w.r.t. biases of the forget gate, of shape (n_a, 1)
                        dbi -- Gradient w.r.t. biases of the update gate, of shape (n_a, 1)
                        dbc -- Gradient w.r.t. biases of the memory gate, of shape (n_a, 1)
                        dbo -- Gradient w.r.t. biases of the save gate, of shape (n_a, 1)
    """

    # Retrieve values from the first cache (t=1) of caches.
    (caches, x) = caches
    (a1, c1, a0, c0, f1, i1, cc1, o1, x1, parameters) = caches[0]
    
    ### START CODE HERE ###
    # Retrieve dimensions from da's and x1's shapes (≈2 lines)
    n_a, m, T_x = da.shape
    n_x, m = x1.shape
    
    # initialize the gradients with the right sizes (≈12 lines)
    dx = np.zeros([n_x, m, T_x])
    da0 = np.zeros([n_a, m])
    da_prevt = np.zeros([n_a, m])
    dc_prevt = np.zeros([n_a, m])
    dWf = np.zeros([n_a, n_a + n_x])
    dWi = np.zeros([n_a, n_a + n_x])
    dWc = np.zeros([n_a, n_a + n_x])
    dWo = np.zeros([n_a, n_a + n_x])
    dbf = np.zeros([n_a, 1])
    dbi = np.zeros([n_a, 1])
    dbc = np.zeros([n_a, 1])
    dbo = np.zeros([n_a, 1])
    
    # loop back over the whole sequence
    for t in reversed(range(T_x)):
        # Compute all gradients using lstm_cell_backward
        gradients = lstm_cell_backward(da[:,:,t], dc_prevt,caches[t])
        # da_prevt, dc_prevt = gradients['da_prev'], gradients["dc_prev"]
        # Store or add the gradient to the parameters' previous step's gradient
        dx[:,:,t] = gradients['dxt']
        dWf = dWf + gradients['dWf']
        dWi = dWi + gradients['dWi']
        dWc = dWc + gradients['dWc']
        dWo = dWo + gradients['dWo']
        dbf = dbf + gradients['dbf']
        dbi = dbi + gradients['dbi']
        dbc = dbc + gradients['dbc']
        dbo = dbo + gradients['dbo']
    # Set the first activation's gradient to the backpropagated gradient da_prev.
    da0 = gradients['da_prev']
    
    ### END CODE HERE ###

    # Store the gradients in a python dictionary
    gradients = {"dx": dx, "da0": da0, "dWf": dWf,"dbf": dbf, "dWi": dWi,"dbi": dbi,
                "dWc": dWc,"dbc": dbc, "dWo": dWo,"dbo": dbo}
    
    return gradients

深度学习 | 吴恩达序列模型专项课程第一周编程作业实验1 --- 使用NumPy逐步构建RNN，LSTM

1.实验综述

2.导入必要的包

3.朴素(基础)RNN单元的前行传播

4.LSTM(Long Short-Term Memory)网络

5.RNN的反向传播(选做)

6.LSTM反向传播(选做)

猜你喜欢