极简使用︱Gemsim-FastText 词向量训练与使用

glove/word2vec/fasttext目前词向量比较通用的三种方式，之前三款词向量的原始训练过程还是挺繁琐的，这边笔者列举一下再自己使用过程中快速训练的方式。
其中，word2vec可见：python︱gensim训练word2vec及相关函数与功能理解
glove可见：极简使用︱Glove-python词向量训练与使用

因为是在gensim之中的，需要安装fasttext，可见：
https://github.com/facebookresearch/fastText/tree/master/python

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .

文章目录

2 、fasttext训练
2.1 训练主函数
2.2 模型的保存与加载
2.3 在线更新语料库
2.4 c++ 版本的fasttext训练
3 fasttext使用

3.1 获得词向量
3.2 词向量词典
3.3 与word2vec 相同的求相似性

4 fasttext 与 word2vec的对比
参考资源

2 、fasttext训练

2.1 训练主函数

from gensim.models import FastText
sentences = [["你", "是", "谁"], ["我", "是", "中国人"]]

model = FastText(sentences,  size=4, window=3, min_count=1, iter=10,min_n = 3 , max_n = 6,word_ngrams = 0)
model['你']  # 词向量获得的方式
model.wv['你'] # 词向量获得的方式

其中FastText主函数为：


class gensim.models.fasttext.FastText(sentences=None, corpus_file=None, sg=0, hs=0, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, word_ngrams=1, sample=0.001, seed=1, workers=3, min_alpha=0.0001, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=10000, callbacks=())

几个参数的含义为：

常规参数：
- model: Training architecture. Allowed values: cbow, skipgram (Default cbow)
- size: Size of embeddings to be learnt (Default 100)
- alpha: Initial learning rate (Default 0.025)
- window: Context window size (Default 5)
- min_count: Ignore words with number of occurrences below this (Default 5)
- loss: Training objective. Allowed values: ns, hs, softmax (Default ns)
- sample: Threshold for downsampling higher-frequency words (Default 0.001)
- negative: Number of negative words to sample, for ns (Default 5)
- iter: Number of epochs (Default 5)
- sorted_vocab: Sort vocab by descending frequency (Default 1)
- threads: Number of threads to use (Default 12)
fasttext附加参数
- min_n: min length of char ngrams (Default 3)
- max_n: max length of char ngrams (Default 6)
- bucket: number of buckets used for hashing ngrams (Default 2000000)
额外参数：
- word_ngrams ({1,0}, optional)
  - If 1, uses enriches word vectors with subword(n-grams) information. If 0, this is equivalent to Word2Vec.

2.2 模型的保存与加载

# 模型保存与加载
model.save(fname)
model = FastText.load(fname)

2.3 在线更新语料库

# 在线更新训练 fasttext
from gensim.models import FastText
sentences_1 = [["cat", "say", "meow"], ["dog", "say", "woof"]]
sentences_2 = [["dude", "say", "wazzup!"]]

model = FastText(min_count=1)
model.build_vocab(sentences_1)
model.train(sentences_1, total_examples=model.corpus_count, epochs=model.iter)

model.build_vocab(sentences_2, update=True)
model.train(sentences_2, total_examples=model.corpus_count, epochs=model.iter)

通过build_vocab来实现

2.4 c++ 版本的fasttext训练

# 使用c++ 版本的fasttext
from gensim.models.wrappers.fasttext import FastText as FT_wrapper

# Set FastText home to the path to the FastText executable
ft_home = '/home/chinmaya/GSOC/Gensim/fastText/fasttext'

# train the model
model_wrapper = FT_wrapper.train(ft_home, lee_train_file)

print(model_wrapper)

3 fasttext使用

3.1 获得词向量

model['你']  # 词向量获得的方式
model.wv['你'] # 词向量获得的方式

两种方式获得词向量

3.2 词向量词典

existent_word = '你'
existent_word in model.wv.vocab
>>> True

3.3 与word2vec 相同的求相似性

其中包括：

model.wv.most_similar(positive=['你', '是'], negative=['中国人'])
model.wv.most_similar_cosmul(positive=['你', '是'], negative=['中国人'])

类比关系，其中most_similar_cosmul使用乘法组合来查找最接近的词（参考url）

model.wv.doesnt_match("你 真的 是".split())  # 找到不匹配的

找出不适合的词

model.wv.similarity('你', '是')  # 求相似
model.n_similarity(['cat', 'say'], ['dog', 'say'])  # 多个词条求相似

similarity求两个词之间的相似性；n_similarity为求多个词之间的相似性

# !pip3 install pyemd 
model.wmdistance(['cat', 'say'], ['dog', 'say']) # 求词条之间的WMD距离

依据词向量求词条之间的WMD距离

4 fasttext 与 word2vec的对比

在案例：Comparison of FastText and Word2Vec之中有官方给出的对比gensim之中，fasttext与word2vec的性能、语义关系比对。
参考博文：https://rare-technologies.com/fasttext-and-gensim-word-embeddings/
在这里插入图片描述

得出的结论：

具有n-gram的FastText模型在语法任务上的表现明显更好，因为句法问题与单词的形态有关；
Gensim word2vec和没有n-gram的fastText模型在语义任务上的效果稍好一些，可能是因为语义问题中的单词是独立的单词而且与它们的char-gram无关；
一般来说，随着语料库大小的增加，模型的性能似乎越来越接近。但是，这可能是由于模型的维度大小保持恒定在100，而大型语料库较大维度的模型大小可能会导致更高的性能提升。
随着语料库大小的增加，所有模型的语义准确性显着增加。
然而，由于n-gram FastText模型的语料库大小的增加，句法准确度的提高较低（相对和绝对术语）。这可能表明，在较大的语料库大小的情况下，通过合并形态学信息获得的优势可能不那么显着（原始论文中使用的语料库似乎也表明了这一点）
最原始的fastText 由c++写的，而gensim是由py写的，运行性能还是c++要快一些

参考资源

1、facebookresearch/fastText
2、案例：Using FastText via Gensim
3、案例：Comparison of FastText and Word2Vec
4、官方教程：models.fasttext – FastText model
5、FastText and Gensim word embeddings