今天看到这篇博文,心里一动决定自己也来试一下:
https://blog.csdn.net/HOMEGREAT/article/details/79720314
1、改码~~~~~
我也在网络上下载了一个诛仙的txt,但是一上来说报错
我想应该是编码问题,可是怎么也不知道是什么编码的,各种试都没用,可是明明用utf确可以读出来的,但用UTF-8也是报错,无奈之中,发现EmEditor这软件右下角有~~~嘿嘿
对的,原来编码是UTF-16LE,真是坑啊~~
然后改代码后就可以运行了
# -*- coding: utf-8 -*-
file_out = open('诛仙_utf8.txt', 'w', encoding="utf-8")
with open('诛仙.txt', 'r', encoding="utf-16LE") as file_object:
for line in file_object:
line = line.strip()
file_out.write(line+"\n")
file_out.close()
print("end")
2、分词~~~~
顺利获得改码后的txt,EmEditor查看,右下角确是UTF-8,然后在这里搞来了
个停用词表https://www.cnblogs.com/zhangtianyuan/p/6922677.html,直接复制粘贴成txt就行,取名stop_words.txt。
import jieba
stop_words_file = "stop_words.txt"
stop_words = list()
with open(stop_words_file, 'r', encoding="utf-8") as stop_words_file_object:
contents = stop_words_file_object.readlines()
for line in contents:
line = line.strip()
stop_words.append(line)
origin_file = "诛仙_utf8.txt"
target_file = open("诛仙_cut_word.txt", 'w', encoding="utf-8")
with open(origin_file, 'r', encoding="utf-8") as origin_file_object:
contents = origin_file_object.readlines()
for line in contents:
line = line.strip()
out_str = ''
word_list = jieba.cut(line, cut_all=False)
for word in word_list:
if word not in stop_words:
if word != "\t":
out_str += word
out_str += ' '
target_file.write(out_str.rstrip()+"\n")
target_file.close()
print("end")
3、训练~~~~~~~
训练我是简略了的,原来的博客有个多线程,因为我也是小白,看不懂,所以就改成 自己看得懂的了, 训练存了个诛仙.model,以后备用,简略了的也是参考这个博客改的https://blog.csdn.net/weixin_37567451/article/details/81131608
import logging
from gensim.models import word2vec
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INFO)
sentences = word2vec.Text8Corpus(r'C:\诛仙_cut_word.txt')
model = word2vec.Word2Vec(sentences,size = 100,hs = 1,min_count =1,window =3)
print('-----------------分割线----------------------------')
try:
sim1 = model.similarity(u'张小凡',u'陆雪琪')
sim2 = model.similarity(u'陆雪琪',u'碧瑶')
except KeyError:
sim1 = 0
sim2 = 0
print(u'张小凡 和 陆雪琪 的相似度为 ',sim1)
print(u'陆雪琪 和 碧瑶 的相似度为 ',sim2)
print('-----------------分割线----------------------------')
print(u'与张小凡最相近的3个字的词')
req_count = 5
for key in model.similar_by_word(u'张小凡',topn =100):
if len(key[0])==3:
req_count -=1
print(key[0],key[1])
if req_count ==0:
break;
print('-----------------分割线----------------------------')
try:
sim3 = model.most_similar(u'张小凡',topn =20)
print(u'和 张小凡 与相关的词有:\n')
for key in sim3:
print(key[0],key[1])
except:
print(' error')
print('-----------------分割线----------------------------')
sim4 = model.doesnt_match(u'张小凡 陆雪琪 碧瑶'.split())
print(u'这类人物中不同类的人名',sim4)
print('-----------------分割线----------------------------')
model.save(u'诛仙.model')
print('end')
4、测试~~~
import warnings
import gensim
warnings.filterwarnings(action='ignore', category=UserWarning,module='gensim')
model = gensim.models.Word2Vec.load("诛仙.model")
word = '张小凡'
result = model.similar_by_word(word)
print("跟 "+word+" 最相近的词:")
for i in result:
print(i)
~~~下次我想试 试三体~~