word2vec-01-初试

今天看到这篇博文，心里一动决定自己也来试一下：

https://blog.csdn.net/HOMEGREAT/article/details/79720314

1、改码~~~~~

我也在网络上下载了一个诛仙的txt，但是一上来说报错

我想应该是编码问题，可是怎么也不知道是什么编码的，各种试都没用，可是明明用utf确可以读出来的，但用UTF-8也是报错，无奈之中，发现EmEditor这软件右下角有~~~嘿嘿

对的，原来编码是UTF-16LE,真是坑啊~~

然后改代码后就可以运行了

# -*- coding: utf-8 -*-

file_out = open('诛仙_utf8.txt', 'w', encoding="utf-8")
with open('诛仙.txt', 'r', encoding="utf-16LE") as file_object:
	for line in file_object:
		line = line.strip()
		file_out.write(line+"\n")
file_out.close()
print("end")

2、分词~~~~

顺利获得改码后的txt,EmEditor查看，右下角确是UTF-8，然后在这里搞来了

个停用词表https://www.cnblogs.com/zhangtianyuan/p/6922677.html，直接复制粘贴成txt就行，取名stop_words.txt。

import jieba
stop_words_file = "stop_words.txt"
stop_words = list()
with open(stop_words_file, 'r', encoding="utf-8") as stop_words_file_object:
    contents = stop_words_file_object.readlines()
    for line in contents:
        line = line.strip()
        stop_words.append(line)

origin_file = "诛仙_utf8.txt"
target_file = open("诛仙_cut_word.txt", 'w', encoding="utf-8")
with open(origin_file, 'r', encoding="utf-8") as origin_file_object:
    contents = origin_file_object.readlines()
    for line in contents:
        line = line.strip()
        out_str = ''
        word_list = jieba.cut(line, cut_all=False)
        for word in word_list:
            if word not in stop_words:
                if word != "\t":
                    out_str += word
                    out_str += ' '
        target_file.write(out_str.rstrip()+"\n")

target_file.close()
print("end")

3、训练~~~~~~~

训练我是简略了的，原来的博客有个多线程，因为我也是小白，看不懂，所以就改成自己看得懂的了，训练存了个诛仙.model，以后备用，简略了的也是参考这个博客改的https://blog.csdn.net/weixin_37567451/article/details/81131608

import logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INFO)
sentences = word2vec.Text8Corpus(r'C:\诛仙_cut_word.txt')
model = word2vec.Word2Vec(sentences,size = 100,hs = 1,min_count =1,window =3)


print('-----------------分割线----------------------------')

try:
    sim1 = model.similarity(u'张小凡',u'陆雪琪')
    sim2 = model.similarity(u'陆雪琪',u'碧瑶')
except KeyError:
    sim1 = 0
    sim2 = 0
print(u'张小凡 和 陆雪琪 的相似度为 ',sim1)
print(u'陆雪琪 和 碧瑶 的相似度为 ',sim2)
print('-----------------分割线----------------------------')

print(u'与张小凡最相近的3个字的词')
req_count = 5
for key in model.similar_by_word(u'张小凡',topn =100):
    if len(key[0])==3:
        req_count -=1
        print(key[0],key[1])
        if req_count ==0:
            break;
print('-----------------分割线----------------------------')

try:
    sim3 = model.most_similar(u'张小凡',topn =20)
    print(u'和 张小凡 与相关的词有：\n')
    for key in sim3:
        print(key[0],key[1])
except:
    print(' error')
print('-----------------分割线----------------------------')


sim4 = model.doesnt_match(u'张小凡 陆雪琪 碧瑶'.split())
print(u'这类人物中不同类的人名',sim4)
print('-----------------分割线----------------------------')
 
model.save(u'诛仙.model')

print('end')

4、测试~~~

import warnings
import gensim

warnings.filterwarnings(action='ignore', category=UserWarning,module='gensim')
model = gensim.models.Word2Vec.load("诛仙.model")

word = '张小凡'
result = model.similar_by_word(word)

print("跟 "+word+" 最相近的词：")
for i in result:
	print(i)

~~~下次我想试试三体~~

猜你喜欢