【代码模版】加载自定义词典、去停用词分词、词性标注、词频统计

# 加载自定义词典(直到退出程序前自定义词典都有效)
import jieba
jieba.load_userdict('dict_path（txt）')

# 去停用词分词方法一：使用jieba.analyse加载停用词表并分词
# 该方法同时完成去停用词、分词、计算tf-idf值并按重要性大小输出结果（默认输出tf-idf排名前20的词）
from jieba import analyse as ana
ana.set_stop_words('stopwords_path(txt)')
ana.extract_tags(txt)

# 去停用词分词方法二：基于函数的去停用词分词
# 生成停用词表
with open('stopwords_path(txt)', 'r') as f:
    data = f.read()
stop_list = data.split('\n')  # 若加载的停用词文档不满足特定需求，可在此处append需要的停用词，或直接在停用词文档补充新的停用词。
# 定义一个去停用词的分词函数
def cut_word_without_stopword(txt):
    return [w for w in jieba.lcut(txt) if w not in stop_list]
# 获得指定格式的最终结果（此处为list of list格式）
final_data = [cut_word_without_stopword(x) for x in origin_data]

# 词性标注
import jieba.posseg as psg
psg.lcut(txt)  # 对txt进行分词并进行词性标注，词性标注的符号可以查表。

# 对内容初步分词
wordlist = jieba.lcut(txt)
# 对分词结果去停用词
with open('stopwords_path(txt)', 'r') as f:
    data = f.read()
stop_list = data.split('\n')  
def cut_word_without_stopword(txt):
    return [w for w in jieba.lcut(txt) if w not in stop_list]
final_data = cut_word_without_stopword(x) for x in origin_data
# 使用pandas进行词频统计，返回列名为word，内容为词的数据框
word_frame = pd.DataFrame(final_data, columns=['word'])
word_count = word_frame.groupby('word').size()
word_freq = word_count.sort_values(ascending=False)
word_freq

不停下脚步的乌龟

发布了40 篇原创文章 · 获赞 0 · 访问量 1697

私信关注

【代码模版】加载自定义词典、去停用词分词、词性标注、词频统计

猜你喜欢