版权声明:转载请声明转自Juanlyjack https://blog.csdn.net/m0_38088359/article/details/82660603
from sklearn.feature_extraction.text import TfidfVectorizer
import string
corpus=[]
punc = "。”;,:—“”()《》"
# punc = punc.encode("utf8") python3不用解码成unicode
import jieba
with open('corpus.txt',encoding='utf8') as file:
for line in file:
text1 = " ".join(jieba.cut(line))
trantab1 = str.maketrans({key:None for key in punc})
trantab2 = str.maketrans({key:None for key in string.punctuation})
text1 = text1.translate(trantab1)
text1 = text1.translate(trantab2)
text1=text1.strip('\n')
corpus.append(text1)
tfidf_model = TfidfVectorizer(smooth_idf=True,use_idf=True,max_features=800)
corpus_tfidf=tfidf_model.fit_transform(corpus)
corpus_tfidf.toarray()
这里所采用的句子文本为44个中文句子,长度不一,此处先人为去掉自定义的标点符号,用str.maketrans()以及.translate()方法来完成,注意这里我不仅去掉了自定义的符号,还用了一次string.punctuation,这里面定义的是英文符号,而且不再是unicode码了。
然后用sklearn.feature_selection中的TfidfVectorizer,将所有句子变成用tfidf表示的向量,此处我选择的最大词频为800,为了避免向量过于稀疏,当然最后要训练模型的时候还要用corpus_tfidf.todense()来进行稠密化。
print(corpus_tfidf.toarray())
print(corpus_tfidf.toarray().shape)
输出如下:
[[0. 0.12044279 0. ... 0. 0.12044279 0. ]
[0. 0.26567307 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
...
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]]
(44, 800)