文章目录

1.创建数据集

1.1 - Listening to the data
1.2 - From audio recordings to spectrograms
1.3 - Generating a single training example
1.4 - Full training set and development set

2 - model

2.1 - Build the model

GRADED FUNCTION: model

2.2 - Fit the model
2.3 - Test the model

3 - Making predictions
4.辅助函数

吴恩达第五课最后一周的关键词检测练习题感觉很好，分享一下~
在学习代码前，强烈建议看下课程哈，课程我整理的简版笔记，吴恩达深度学习第五课第三周序列模型和注意力机制很快可以看完。
话不多说，先导包

import numpy as np
from pydub import AudioSegment
import random
import sys
import io
import os
import glob
import IPython
from td_utils import *
%matplotlib inline

这里td_utils是课程的几个function，在最后我们看下。

1.创建数据集

数据集包含两部分：正向数据集、负向数据集、背景声音数据集，这里我们是要检测’active’，那么正向数据集就是包含’active’

1.1 - Listening to the data

IPython.display.Audio("./raw_data/activates/1.wav")# 试听一下

1.2 - From audio recordings to spectrograms

我们使用的声音样本在44100Hz，我们可以认为每秒的声音的样本数字为44100，因此10秒的声音样本数字为441000。
从原始的音频上判断是否有’active’是比较难的，我们需要计算音频的声谱图(spectrogram)。执行以下代码查看声谱图，关于’graph_spectrogram’我们在最后讲解
x = graph_spectrogram("audio_examples/example_train.wav")

1.3 - Generating a single training example

输出的谱图尺寸取决于输入的长度和谱图软件的参数，我们将采用10秒音频长，输出的时间步长为 $T_x=$ 5511

_, data = wavfile.read("audio_examples/example_train.wav")
print("Time steps in audio recording before spectrogram", data[:,0].shape)
print("Time steps in input after spectrogram", x.shape)
#output
Time steps in audio recording before spectrogram (441000,)
Time steps in input after spectrogram (101, 5511)

我们可以看到计算声谱图后的时间步长变为5511，因为我们定义输入的时间步长度为5511，每个时间步长度的频率数为101，输出的步长为1375。

Tx = 5511 # The number of time steps input to the model from the spectrogram
n_freq = 101 # Number of frequencies input to the model at each time step of the spectrogram
Ty = 1375 # The number of time steps in the output of our model

$T_y$ =1375，意味着我们把十秒的输出离散成1375个时间间隔，每个间隔长度 $\frac{10}{1375}$ 约为0.0072秒，预测每一个间隔是否有人说’active’。
加载三个数据集，加载函数这个我们也在最后介绍下

activates, negatives, backgrounds = load_raw_audio()
print("background len: " + str(len(backgrounds[0])))    # Should be 10,000, since it is a 10 sec clip
print("activate[0] len: " + str(len(activates[0])))     # Maybe around 1000, since an "activate" audio clip is usually around 1 sec (but varies a lot)
print("activate[1] len: " + str(len(activates[1])))     # Different "activate" clips can have different lengths

下图为我们合成训练集时插入多个不同单词的音频，有两点注意，只有在我们插入’active’， $y^{(t)}$ 才设为1，并且1持续了持续了50个时间步，如我们在688个时间步检测到了’active’，那么从688到737都为1
在这里插入图片描述
合成训练集需要用到以下几个辅助函数，这几个函数会使用1ms的间隔将10秒的数据离散车10000步。
辅助函数1，从背景音频中获取一个随机时间段，segment_ms表示插入时间段有几个1ms间隔。

def get_random_time_segment(segment_ms):
    """
    Gets a random time segment of duration segment_ms in a 10,000 ms audio clip.
    Arguments:
    segment_ms -- the duration of the audio clip in ms ("ms" stands for "milliseconds")
    Returns:
    segment_time -- a tuple of (segment_start, segment_end) in ms
    """
    segment_start = np.random.randint(low=0, high=10000-segment_ms)   # Make sure segment doesn't run past the 10sec background 
    segment_end = segment_start + segment_ms - 1
    return (segment_start, segment_end)

辅助函数2，查看获取的随机随机时间段是否与现有的时间段重合。

#GRADED FUNCTION: is_overlapping
def is_overlapping(segment_time, previous_segments):# let me pass
    """
    Checks if the time of a segment overlaps with the times of existing segments.
    Arguments:
    segment_time -- a tuple of (segment_start, segment_end) for the new segment
    previous_segments -- a list of tuples of (segment_start, segment_end) for the existing segments
    Returns:
    True if the time segment overlaps with any of the existing segments, False otherwise
    """
    segment_start, segment_end = segment_time
    ### START CODE HERE ### (≈ 4 line)
    # Step 1: Initialize overlap as a "False" flag. (≈ 1 line)
    overlap = False
    # Step 2: loop over the previous_segments start and end times.
    # Compare start/end times and set the flag to True if there is an overlap (≈ 3 lines)
    for previous_start, previous_end in previous_segments:
        if segment_start <= previous_end and segment_end >= previous_start:
            overlap = True
    ### END CODE HERE ###
    return overlap

辅助函数3，随机插入一个音频段：
1.首先根据新的时间段长度获取一个随机的相同长度的时间段。
2.其次确保获取的时间段没有覆盖之前的时间段，如果覆盖了之前的时间段，重新抽取。
3.将新的随机的时间段添加到抽取时间段的列表中。
4.在背景音频中覆盖新的音频，这里课程提供的background_overlay是课程提供音频覆盖函数，我们在最后补充

#GRADED FUNCTION: insert_audio_clip
def insert_audio_clip(background, audio_clip, previous_segments): # let me pass
    """
    Insert a new audio segment over the background noise at a random time step, ensuring that the 
    audio segment does not overlap with existing segments.
    Arguments:
    background -- a 10 second background audio recording.  
    audio_clip -- the audio clip to be inserted/overlaid. 
    previous_segments -- times where audio segments have already been placed 
    Returns:
    new_background -- the updated background audio
    """
    # Get the duration of the audio clip in ms
    segment_ms = len(audio_clip)
    ### START CODE HERE ### 
    # Step 1: Use one of the helper functions to pick a random time segment onto which to insert 
    # the new audio clip. (≈ 1 line)
    segment_time = get_random_time_segment(segment_ms)
    # Step 2: Check if the new segment_time overlaps with one of the previous_segments. If so, keep 
    # picking new segment_time at random until it doesn't overlap. (≈ 2 lines)
    while is_overlapping(segment_time,previous_segments):
        segment_time = get_random_time_segment(segment_ms)
    # Step 3: Add the new segment_time to the list of previous_segments (≈ 1 line)
    previous_segments.append(segment_time)
    ### END CODE HERE ###
    # Step 4: Superpose audio segment and background
    new_background = background.overlay(audio_clip, position = segment_time[0])
    return new_background, segment_time

辅助函数4，在含有’active’的音频段添加1到标签向量y上，例如表示’active’的音频结束在时间步t，我们要将t时间步延后50个时间步的标签值都设为1，并且要确保不能超过时间步的最大值1375。

#GRADED FUNCTION: insert_ones
def insert_ones(y, segment_end_ms):
    """
    Update the label vector y. The labels of the 50 output steps strictly after the end of the segment 
    should be set to 1. By strictly we mean that the label of segment_end_y should be 0 while, the
    50 followinf labels should be ones.
    Arguments:
    y -- numpy array of shape (1, Ty), the labels of the training example
    segment_end_ms -- the end time of the segment in ms
    Returns:
    y -- updated labels
    """
    # duration of the background (in terms of spectrogram time-steps)
    segment_end_y = int(segment_end_ms * Ty / 10000.0)
    # Add 1 to the correct index in the background label (y)
    ### START CODE HERE ### (≈ 3 lines)
    for i in range(segment_end_y+1, segment_end_y+51):
        if i < Ty :
            y[0, i] = 1.0
    ### END CODE HERE ###
    return y

接下来就可以定义一个函数来创建新的数据集：
1.初始化函数y
2.初始化随机时间段的列表
3.随机抽取4个’active’列表，并插入到背景音频中
4.随机选择两个负向样本，插入到背景音频中。
注意：这里previous_segments保存了正向样本和负向样本抽取的时间段。

#GRADED FUNCTION: create_training_example
def create_training_example(background, activates, negatives):
    """
    Creates a training example with a given background, activates, and negatives.
    Arguments:
    background -- a 10 second background audio recording
    activates -- a list of audio segments of the word "activate"
    negatives -- a list of audio segments of random words that are not "activate"
    Returns:
    x -- the spectrogram of the training example
    y -- the label at each time step of the spectrogram
    """
    # Set the random seed
    np.random.seed(18)
    # Make background quieter
    background = background - 20
    ### START CODE HERE ###
    # Step 1: Initialize y (label vector) of zeros (≈ 1 line)
    y = np.zeros((1,Ty))
    # Step 2: Initialize segment times as empty list (≈ 1 line)
    previous_segments = []
    ### END CODE HERE ###
    # Select 0-4 random "activate" audio clips from the entire list of "activates" recordings
    number_of_activates = np.random.randint(0, 5)
    random_indices = np.random.randint(len(activates), size=number_of_activates)
    random_activates = [activates[i] for i in random_indices]
    ### START CODE HERE ### (≈ 3 lines)
    # Step 3: Loop over randomly selected "activate" clips and insert in background
    for random_activate in random_activates:
        # Insert the audio clip on the background
        background, segment_time = insert_audio_clip(background,random_activate,previous_segments)
        # Retrieve segment_start and segment_end from segment_time
        segment_start, segment_end = segment_time
        # Insert labels in "y"
        y = insert_ones(y,segment_end)
    ### END CODE HERE ###
    # Select 0-2 random negatives audio recordings from the entire list of "negatives" recordings
    number_of_negatives = np.random.randint(0, 3)
    random_indices = np.random.randint(len(negatives), size=number_of_negatives)
    random_negatives = [negatives[i] for i in random_indices]
    ### START CODE HERE ### (≈ 2 lines)
    # Step 4: Loop over randomly selected negative clips and insert in background
    for random_negative in random_negatives:
        # Insert the audio clip on the background 
        background, _ = insert_audio_clip(background,random_activate,previous_segments)
    ### END CODE HERE ###
    # Standardize the volume of the audio clip 
    background = match_target_amplitude(background, -20.0)
    # Export new training example 
    file_handle = background.export("train" + ".wav", format="wav")
    print("File (train.wav) was saved in your directory.")
    # Get and plot spectrogram of the new recording (background with superposition of positive and negatives)
    x = graph_spectrogram("train.wav")
    return x, y

x, y = create_training_example(backgrounds[0], activates, negatives)
plt.plot(y[0])

生成新的声谱图如下

更新的标签y如下

1.4 - Full training set and development set

采用上述代码就可以创建样本数据集啦，这里课程为了节省时间，直接加载已经创建好的数据集。这里的dev(development)并不是合成数据，而是由真人采集并做标记标签y的数据集，为了符合我们在之前的课程中讲述的dev数据集应该与test set要满足相同分布。

#Load preprocessed training examples
X = np.load("./XY_train/X.npy")
Y = np.load("./XY_train/Y.npy")
#Load preprocessed dev set examples
X_dev = np.load("./XY_dev/X_dev.npy")
Y_dev = np.load("./XY_dev/Y_dev.npy")

2 - model

导入需要的包，这里使用keras高级框架，一维卷积Conv1D

from keras.callbacks import ModelCheckpoint
from keras.models import Model, load_model, Sequential
from keras.layers import Dense, Activation, Dropout, Input, Masking, TimeDistributed, LSTM, Conv1D
from keras.layers import GRU, Bidirectional, BatchNormalization, Reshape
from keras.optimizers import Adam

2.1 - Build the model

以下是模型结构，关键的一步：一维卷积(Conv1D)将输入的5511个特征转化为1375个特征，一是有提取特征的作用，二是减少了GRU层的计算。
注意：这里的RNN网络是单层网络，我们需要触发字出现后立即检测，如果采用双层RNN，我们需要等到10后才能检测到音频的第一秒有’active’。

GRADED FUNCTION: model

def model(input_shape):
    """
    Function creating the model's graph in Keras.
    Argument:
    input_shape -- shape of the model's input data (using Keras conventions)
    Returns:
    model -- Keras model instance
    """
    X_input = Input(shape = input_shape)
    ### START CODE HERE ###
    # Step 1: CONV layer (≈4 lines)
    X = Conv1D(196,15,strides=4)(X_input)                                 # CONV1D
    X = BatchNormalization()(X)                                 # Batch normalization
    X = Activation('relu')(X)                                 # ReLu activation
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    # Step 2: First GRU Layer (≈4 lines)
    X = GRU(units=128,return_sequences=True)(X)                                 # GRU (use 128 units and return the sequences)
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                                 # Batch normalization
    # Step 3: Second GRU Layer (≈4 lines)
    X = GRU(units=128,return_sequences=True)(X)                                 # GRU (use 128 units and return the sequences)
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                                 # Batch normalization
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)    
    # Step 4: Time-distributed dense layer (≈1 line)
    X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed  (sigmoid)# 将全连接层应用于输入的每个时间步
    ### END CODE HERE ###
    model = Model(inputs = X_input, outputs = X)
    return model

生成模型：
model = model(input_shape = (Tx, n_freq))
模型总结：
model.summary()

2.2 - Fit the model

这里课程为了节省时间，直接加载已经训练3个小时的模型。

model = load_model('./models/tr_model.h5')
opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
model.fit(X, Y, batch_size = 5, epochs=1)

2.3 - Test the model

loss, acc = model.evaluate(X_dev, Y_dev)
print("Dev set accuracy = ", acc)

3 - Making predictions

def detect_triggerword(filename):
    plt.subplot(2, 1, 1)
    x = graph_spectrogram(filename)
    # the spectogram outputs (freqs, Tx) and we want (Tx, freqs) to input into the model
    x  = x.swapaxes(0,1)# 轴对称
    x = np.expand_dims(x, axis=0)
    predictions = model.predict(x)
    plt.subplot(2, 1, 2)
    plt.plot(predictions[0,:,0])
    plt.ylabel('probability')
    plt.show()
    return predictions

chime_file = "audio_examples/chime.wav"
def chime_on_activate(filename, predictions, threshold):
    audio_clip = AudioSegment.from_wav(filename)
    chime = AudioSegment.from_wav(chime_file)
    Ty = predictions.shape[1]
    # Step 1: Initialize the number of consecutive output steps to 0
    consecutive_timesteps = 0
    # Step 2: Loop over the output steps in the y
    for i in range(Ty):
        # Step 3: Increment consecutive output steps
        consecutive_timesteps += 1
        # Step 4: If prediction is higher than the threshold and more than 75 consecutive output steps have passed
        if predictions[0,i,0] > threshold and consecutive_timesteps > 75:
            # Step 5: Superpose audio and background using pydub
            audio_clip = audio_clip.overlay(chime, position = ((i / Ty) * audio_clip.duration_seconds)*1000)
            # Step 6: Reset consecutive output steps to 0
            consecutive_timesteps = 0 
    audio_clip.export("chime_output.wav", format='wav')

预测

filename = "./raw_data/dev/1.wav"
prediction = detect_triggerword(filename)
chime_on_activate(filename, prediction, 0.5)
IPython.display.Audio("./chime_output.wav")

4.辅助函数

import matplotlib.pyplot as plt
from scipy.io import wavfile
import os
from pydub import AudioSegment
#Calculate and plot spectrogram for a wav audio file
def graph_spectrogram(wav_file):
    rate, data = get_wav_info(wav_file)
    nfft = 200 # Length of each window segment
    fs = 8000 # Sampling frequencies
    noverlap = 120 # Overlap between windows
    nchannels = data.ndim
    if nchannels == 1:
        pxx, freqs, bins, im = plt.specgram(data, nfft, fs, noverlap = noverlap)
    elif nchannels == 2:
        pxx, freqs, bins, im = plt.specgram(data[:,0], nfft, fs, noverlap = noverlap)
    return pxx
#Load a wav file
def get_wav_info(wav_file):
    rate, data = wavfile.read(wav_file)
    return rate, data
#Used to standardize volume of audio clip
def match_target_amplitude(sound, target_dBFS):
    change_in_dBFS = target_dBFS - sound.dBFS
    return sound.apply_gain(change_in_dBFS)
#Load raw audio files for speech synthesis
#加载三种数据集并分为三个列表
def load_raw_audio():
    activates = []
    backgrounds = []
    negatives = []
    for filename in os.listdir("./raw_data/activates"):
        if filename.endswith("wav"):
            activate = AudioSegment.from_wav("./raw_data/activates/"+filename)
            activates.append(activate)
    for filename in os.listdir("./raw_data/backgrounds"):
        if filename.endswith("wav"):
            background = AudioSegment.from_wav("./raw_data/backgrounds/"+filename)
            backgrounds.append(background)
    for filename in os.listdir("./raw_data/negatives"):
        if filename.endswith("wav"):
            negative = AudioSegment.from_wav("./raw_data/negatives/"+filename)
            negatives.append(negative)
    return activates, negatives, backgrounds

吴恩达深度学习课后练习 Trigger word detection