使用collection中的Counter()类
sentences = [
['the', 'cat', 'is', 'running', 'in', 'the', 'room'],
['I', 'love', 'you', 'very', 'much'],
['my', 'kids', 'are', 'smart']]
mini_frq = 1
word_counts = collections.Counter()
for sent in sentences:
word_counts.update(sent)
print(word_counts)
print('=======================================')
print(word_counts.most_common())# 按照value进行排序,并改成list of tuple形式
print('=========================================')
vocabulary_inv = ['<START>', '<UNK>', '<END>'] + \
[x[0] for x in word_counts.most_common() if x[1] >= mini_frq]
print(vocabulary_inv)
print('==========================================')
vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
print(vocabulary)
Counter({'the': 2, 'cat': 1, 'is': 1, 'running': 1, 'in': 1, 'room': 1, 'I': 1, 'love': 1, 'you': 1, 'very': 1, 'much': 1, 'my': 1, 'kids': 1, 'are': 1, 'smart': 1})
=======================================
[('the', 2), ('cat', 1), ('is', 1), ('running', 1), ('in', 1), ('room', 1), ('I', 1), ('love', 1), ('you', 1), ('very', 1), ('much', 1), ('my', 1), ('kids', 1), ('are', 1), ('smart', 1)]
=========================================
['<START>', '<UNK>', '<END>', 'the', 'cat', 'is', 'running', 'in', 'room', 'I', 'love', 'you', 'very', 'much', 'my', 'kids', 'are', 'smart']
==========================================
{'<START>': 0, '<UNK>': 1, '<END>': 2, 'the': 3, 'cat': 4, 'is': 5, 'running': 6, 'in': 7, 'room': 8, 'I': 9, 'love': 10, 'you': 11, 'very': 12, 'much': 13, 'my': 14, 'kids': 15, 'are': 16, 'smart': 17}
上述使用了Collection模块,该模块的具体讲解可参考链接:
http://www.pythoner.com/205.html