这是俄罗斯高等经济学院的系列课程第一门,Introduction to Advanced Machine Learning,第五周第一个编程作业,目的是通过训练一个language model,用来生成名字。
这个作业只有一个任务,难易程度:容易。
0. 读入文件,生成字典。生成字典的方法可以参考另一篇博客,
Python中string, tuple,list,dictionary的区别(之二,高级用法与类型转换)
1. 通过keras layers,构建一个RNN,然后开始训练。
Generating names with recurrent neural networks
This time you’ll find yourself delving into the heart (and other intestines) of recurrent neural networks on a class of toy problems.
Struggle to find a name for the variable? Let’s see how you’ll come up with a name for your son/daughter. Surely no human has expertize over what is a good child name, so let us train RNN instead;
It’s dangerous to go alone, take these:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Our data
The dataset contains ~8k earthling names from different cultures, all in latin transcript.
This notebook has been designed so as to allow you to quickly swap names for something similar: deep learning article titles, IKEA furniture, pokemon names, etc.
import os
start_token = " "
with open("names") as f:
names = f.read()[:-1].split('\n')
names = [start_token+name for name in names]
print ('n samples = ',len(names))
for x in names[::1000]:# print the names every 1000 names
print (x)
n samples = 7944
Abagael
Claresta
Glory
Liliane
Prissie
Geeta
Giovanne
Piggy
MAX_LENGTH = max(map(len,names))
print("max length =", MAX_LENGTH)
plt.title('Sequence length distribution')
plt.hist(list(map(len,names)),bins=25);
max length = 16
!
Text processing
First we need next to collect a “vocabulary” of all unique tokens i.e. unique characters. We can then encode inputs as a sequence of character ids.
#all unique characters go here
tokens = set(''.join(names[:]))
print(tokens)
tokens = list(tokens)
n_tokens = len(tokens)
print ('n_tokens = ',n_tokens)
#assert 50 < n_tokens < 60
{'s', 'q', 't', 'w', 'j', 'P', 'U', 'e', 'V', 'i', 'X', 'F', 'v', 'T', 'H', 'g', 'C', 'Q', 'z', '-', 'n', 'Z', 'B', 'A', 'K', 'O', 'L', 'R', 'y', 'p', 'x', 'm', 'd', 'W', ' ', 'G', 'D', 'S', 'Y', 'o', 'u', 'N', 'f', 'a', 'l', 'J', "'", 'I', 'b', 'c', 'k', 'E', 'r', 'h', 'M'}
n_tokens = 55
Cast everything from symbols into identifiers
Tensorflow string manipulation is a bit tricky, so we’ll work around it.
We’ll feed our recurrent neural network with ids of characters from our dictionary.
To create such dictionary, let’s assign
token_to_id = { ch:i for i,ch in enumerate((tokens)) }
###YOUR CODE HERE: create a dictionary of {symbol -> its index in tokens }
token_to_id
{' ': 34,
"'": 46,
'-': 19,
'A': 23,
'B': 22,
'C': 16,
'D': 36,
'E': 51,
'F': 11,
'G': 35,
'H': 14,
'I': 47,
'J': 45,
'K': 24,
'L': 26,
'M': 54,
'N': 41,
'O': 25,
'P': 5,
'Q': 17,
'R': 27,
'S': 37,
'T': 13,
'U': 6,
'V': 8,
'W': 33,
'X': 10,
'Y': 38,
'Z': 21,
'a': 43,
'b': 48,
'c': 49,
'd': 32,
'e': 7,
'f': 42,
'g': 15,
'h': 53,
'i': 9,
'j': 4,
'k': 50,
'l': 44,
'm': 31,
'n': 20,
'o': 39,
'p': 29,
'q': 1,
'r': 52,
's': 0,
't': 2,
'u': 40,
'v': 12,
'w': 3,
'x': 30,
'y': 28,
'z': 18}
assert len(tokens) == len(token_to_id), "dictionaries must have same size"
for i in range(n_tokens):
assert token_to_id[tokens[i]] == i, "token identifier must be it's position in tokens list"
print("Seems alright!")
Seems alright!
def to_matrix(names,max_len=None,pad=0,dtype='int32'):
"""Casts a list of names into rnn-digestable matrix"""
max_len = max_len or max(map(len,names))
names_ix = np.zeros([len(names),max_len],dtype) + pad
for i in range(len(names)):
name_ix = list(map(token_to_id.get,names[i]))
names_ix[i,:len(name_ix)] = name_ix
return names_ix.T
#Example: cast 4 random names to matrices, pad with zeros
print('\n'.join(names[::2000]))
print(to_matrix(names[::2000]).T)
Abagael
Glory
Prissie
Giovanne
[[34 23 48 43 15 43 7 44 0]
[34 35 44 39 52 28 0 0 0]
[34 5 52 9 0 0 9 7 0]
[34 35 9 39 12 43 20 20 7]]
Recurrent neural network
We can rewrite recurrent neural network as a consecutive application of dense layer to input
and previous rnn state
. This is exactly what we’re gonna do now.
Since we’re training a language model, there should also be:
* An embedding layer that converts character id x_t to a vector.
* An output layer that predicts probabilities of next phoneme
import keras
from keras.layers import Concatenate,Dense,Embedding
rnn_num_units = 64
embedding_size = 16
#Let's create layers for our recurrent network
#Note: we create layers but we don't "apply" them yet
embed_x = Embedding(n_tokens,embedding_size) # an embedding layer that converts character ids into embeddings
#a dense layer that maps input and previous state to new hidden state, [x_t,h_t]->h_t+1
get_h_next = Dense(rnn_num_units, activation='relu') ###YOUR CODE HERE
#a dense layer that maps current hidden state to probabilities of characters [h_t+1]->P(x_t+1|h_t+1)
get_probas = Dense(n_tokens, activation = 'softmax')###YOUR CODE HERE
#Note: please either set the correct activation to Dense or write it manually in rnn_one_step
def rnn_one_step(x_t, h_t):
"""
Recurrent neural network step that produces next state and output
given prev input and previous state.
We'll call this method repeatedly to produce the whole sequence.
Follow inline isntructions to complete the function.
"""
#convert character id into embedding
x_t_emb = embed_x(tf.reshape(x_t,[-1,1]))[:,0]
#concatenate x embedding and previous h state
x_and_h = tf.concat([x_t_emb,h_t],1)###YOUR CODE HERE
#compute next state given x_and_h
h_next = get_h_next(x_and_h)###YOUR CODE HERE
#get probabilities for language model P(x_next|h_next)
output_probas = get_probas(h_next)###YOUR CODE HERE
return output_probas,h_next
RNN loop
Once rnn_one_step is ready, let’s apply it in a loop over name characters to get predictions.
Let’s assume that all names are at most length-16 for now, so we can simply iterate over them in a for loop.
input_sequence = tf.placeholder('int32',(MAX_LENGTH,None))
batch_size = tf.shape(input_sequence)[1]
predicted_probas = []
h_prev = tf.zeros([batch_size,rnn_num_units]) #initial hidden state
for t in range(MAX_LENGTH):
x_t = input_sequence[t]
probas_next,h_next = rnn_one_step(x_t,h_prev)
h_prev = h_next
predicted_probas.append(probas_next)
predicted_probas = tf.stack(predicted_probas)
RNN: loss and gradients
Let’s gather a matrix of predictions for and the corresponding correct answers.
Our network can then be trained by minimizing crossentropy between predicted probabilities and those answers.
predictions_matrix = tf.reshape(predicted_probas[:-1],[-1,len(tokens)])
answers_matrix = tf.one_hot(tf.reshape(input_sequence[1:],[-1]), n_tokens)
loss = tf.reduce_mean(tf.reduce_sum(-answers_matrix*tf.log(tf.clip_by_value(predictions_matrix,1e-10,1.0)), reduction_indices=[1]))
#loss = <define loss as categorical crossentropy. Mind that predictions are probabilities and NOT logits!>
optimize = tf.train.AdamOptimizer().minimize(loss)
The training loop
from IPython.display import clear_output
from random import sample
s = keras.backend.get_session()
s.run(tf.global_variables_initializer())
history = []
for i in range(1000):
batch = to_matrix(sample(names,32),max_len=MAX_LENGTH)
loss_i,_ = s.run([loss,optimize],{input_sequence:batch})
history.append(loss_i)
if (i+1)%100==0:
clear_output(True)
plt.plot(history,label='loss')
plt.legend()
plt.show()
assert np.mean(history[:10]) > np.mean(history[-10:]), "RNN didn't converge."
!
RNN: sampling
Once we’ve trained our network a bit, let’s get to actually generating stuff. All we need is the rnn_one_step
function you have written above.
x_t = tf.placeholder('int32',(None,))
h_t = tf.Variable(np.zeros([1,rnn_num_units],'float32'))
next_probs,next_h = rnn_one_step(x_t,h_t)
def generate_sample(seed_phrase=' ',max_length=MAX_LENGTH):
'''
The function generates text given a phrase of length at least SEQ_LENGTH.
parameters:
The phrase is set using the variable seed_phrase
The optional input "N" is used to set the number of characters of text to predict.
'''
x_sequence = [token_to_id[token] for token in seed_phrase]
s.run(tf.assign(h_t,h_t.initial_value))
#feed the seed phrase, if any
for ix in x_sequence[:-1]:
s.run(tf.assign(h_t,next_h),{x_t:[ix]})
#start generating
for _ in range(max_length-len(seed_phrase)):
x_probs,_ = s.run([next_probs,tf.assign(h_t,next_h)],{x_t:[x_sequence[-1]]})
x_sequence.append(np.random.choice(n_tokens,p=x_probs[0]))
return ''.join([tokens[ix] for ix in x_sequence])
for _ in range(10):
print(generate_sample())
Marderdesssssss
Rinssssssssssss
Meresssssssssss
Neunyssssssssss
Ollelaveassssss
Hrercnossssssss
Waltiesssssssss
Calriesssssssss
Zalinhsssssssss
Halonilasssssss
for _ in range(50):
print(generate_sample(' Trump'))
Trumpasssssssss
Trumphealesssss
Trumphllassssss
Trumpesssssssss
Trumpadssssssss
Trumpanssssssss
Trumpheanssssss
Trumphasassssss
Trumpiresssssss
Trumpinvessssss
Trumpisssssssss
Trumpeeolysssss
Trumpigesssssss
Trumpnassssssss
Trumpyrasssssss
Trumprnesssssss
Trumposdassssss
Trumpedonssssss
Trumpleilesssss
Trumpadssssssss
Trumpesssssssss
Trumpssssssssss
Trumpiessssssss
Trumpoodsssssss
Trumpurasssssss
Trumpystsssssss
Trumpilysssssss
Trumpenssssssss
Trumpysssssssss
Trumpssssssssss
Trumpasnsssssss
Trumpiessssssss
Trumpasssssssss
Trumplhssssssss
Trumpoinsssssss
Trumpiessssssss
Trumpornassssss
Trumpanasssssss
Try it out!
Disclaimer: This assignment is entirely optional. You won’t receive bonus points for it. However, it’s a fun thing to do. Please share your results on course forums.
You’ve just implemented a recurrent language model that can be tasked with generating any kind of sequence, so there’s plenty of data you can try it on:
- Novels/poems/songs of your favorite author
- News titles/clickbait titles
- Source code of Linux or Tensorflow
- Molecules in smiles format
- Melody in notes/chords format
- Ikea catalog titles
- Pokemon names
- Cards from Magic, the Gathering / Hearthstone
If you’re willing to give it a try, here’s what you wanna look at:
* Current data format is a sequence of lines, so a novel can be formatted as a list of sentences. Alternatively, you can change data preprocessing altogether.
* While some datasets are readily available, others can only be scraped from the web. Try Selenium
or Scrapy
for that.
* Make sure MAX_LENGTH is adjusted for longer datasets. There’s also a bonus section about dynamic RNNs at the bottom.
* More complex tasks require larger RNN architecture, try more neurons or several layers. It would also require more training iterations.
* Long-term dependencies in music, novels or molecules are better handled with LSTM or GRU
Good hunting!
Bonus level: dynamic RNNs
Apart from keras, there’s also a friendly tensorflow API for recurrent neural nets. It’s based around the symbolic loop function (aka scan).
This interface allows for dynamic sequence length and comes with some pre-implemented architectures.
class CustomRNN(tf.nn.rnn_cell.BasicRNNCell):
def call(self,input,state):
return rnn_one_step(input[:,0],state)
@property
def output_size(self):
return n_tokens
cell = CustomRNN(rnn_num_units)
input_sequence = tf.placeholder('int32',(None,None))
predicted_probas, last_state = tf.nn.dynamic_rnn(cell,input_sequence[:,:,None],
time_major=True,dtype='float32')
print predicted_probas.eval({input_sequence:to_matrix(names[:10],max_len=50)}).shape
Note that we never used MAX_LENGTH in the code above: TF will iterate over however many time-steps you gave it.
You can also use the all the pre-implemented RNN cells:
for obj in dir(tf.nn.rnn_cell)+dir(tf.contrib.rnn):
if obj.endswith('Cell'):
print (obj)
input_sequence = tf.placeholder('int32',(None,None))
inputs_embedded = embed_x(input_sequence)
cell = tf.nn.rnn_cell.LSTMCell(rnn_num_units)
state_sequence,last_state = tf.nn.dynamic_rnn(cell,inputs_embedded,dtype='float32')
print('LSTM visible states[time,batch,unit]:', state_sequence)