Python3 分词去掉文本标点(自主定义)并构建tfidf词向量

from sklearn.feature_extraction.text import TfidfVectorizer
import string
corpus=[]
punc = "。”；，：—“”（）《》"
# punc = punc.encode("utf8") python3不用解码成unicode
import jieba
with open('corpus.txt',encoding='utf8') as file:
    for line in file:
        text1 = " ".join(jieba.cut(line))
        trantab1 = str.maketrans({key:None for key in punc})
        trantab2 = str.maketrans({key:None for key in string.punctuation})
        text1 = text1.translate(trantab1)
        text1 = text1.translate(trantab2)
        text1=text1.strip('\n')
        corpus.append(text1)
tfidf_model = TfidfVectorizer(smooth_idf=True,use_idf=True,max_features=800)
corpus_tfidf=tfidf_model.fit_transform(corpus)
corpus_tfidf.toarray()

这里所采用的句子文本为44个中文句子,长度不一,此处先人为去掉自定义的标点符号,用str.maketrans()以及.translate()方法来完成，注意这里我不仅去掉了自定义的符号，还用了一次string.punctuation，这里面定义的是英文符号，而且不再是unicode码了。
然后用sklearn.feature_selection中的TfidfVectorizer，将所有句子变成用tfidf表示的向量，此处我选择的最大词频为800，为了避免向量过于稀疏，当然最后要训练模型的时候还要用corpus_tfidf.todense()来进行稠密化。

print(corpus_tfidf.toarray())
print(corpus_tfidf.toarray().shape)

输出如下:

[[0.         0.12044279 0.         ... 0.         0.12044279 0.        ]
 [0.         0.26567307 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
(44, 800)

Python3 分词去掉文本标点(自主定义)并构建tfidf词向量

猜你喜欢