版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/API1_7/article/details/83628624
- 找Word2Vec的工具,实现看效果
- Word2Vec(Google):
- Capture many linguistic regularities
For example vector operations vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’) - From words to phrases and beyond
Example vector for representing ‘san francisco’ - Word Consine distance
- Word clustering
Deriving word classes from huge data sets. This is achieved by performing K-means clustering on top of the word vectors. The output is a vocabulary file with words and their corresponding class IDs
- Capture many linguistic regularities
- Performance
- Architecture:
- Skip-Gram: slower, better for infrequent words
- CBOW: fast
- The training algorithm:
- hierarchical softmax: better for infrequent words
- negative sampling: better for frequent words, better with low dimensional vectors
- Sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 1e-3 to 1e-5)
- Dimensionality of the word vectors: usually more is better, but not always
- Context(window) size:
- skip-gram: around 10
- CBOW: around 5
- Architecture:
- 获取训练数据(黑体的训练数据在参考网站都有网址)
- First billion characters from wikipedia (use the pre-processing perl script from the bottom of Matt Mahoney’s page)
- Latest Wikipedia dump Use the same script as above to obtain clean text. Should be more than 3 billion words.
- WMT11 site: text data for several languages (duplicate sentences should be removed before training the models)
- Dataset from "One Billion Word Language Modeling Benchmark" Almost 1B words, already pre-processed text.
- UMBC webbase corpus Around 3 billion words, more info here. Needs further processing (mainly tokenization).
- Text data from more languages can be obtained at statmt.org and in the Polyglot project(亲测好评).
- 总之Google的word2vec网站有很多可探索的东西
- 影响词向量质量的因素
- 训练数据的数量和质量
- 词向量的大小
- 训练算法
- Word2Vec(Google):