使用jieba分词对中文文档进行分词|停用词去重

编程语言 2018-10-31 13:41:24 阅读次数: 0

版权声明：博客仅供参考，有什么意见，请在下方留言，转载时请附上链接，谢谢！ https://blog.csdn.net/u010105243/article/details/53363416

1.使用jieba分词对中文文档进行分词

# -*- coding: utf-8 -*-
# @Time    : 17-8-4 上午9:26
# @Author  : 未来战士biubiu！！
# @FileName: test.py
# @Software: PyCharm Community Edition
# @Blog    ：http://blog.csdn.net/u010105243/article/
# Python3
import jieba


# jieba.load_userdict('userdict.txt')
# 创建停用词list
def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords


# 对句子进行分词
def seg_sentence(sentence):
    sentence_seged = jieba.cut(sentence.strip())
    stopwords = stopwordslist('./test/stopwords.txt')  # 这里加载停用词的路径
    outstr = ''
    for word in sentence_seged:
        if word not in stopwords:
            if word != '\t':
                outstr += word
                outstr += " "
    return outstr


inputs = open('./test/input.txt', 'r', encoding='utf-8')
outputs = open('./test/output.txt', 'w')
for line in inputs:
    line_seg = seg_sentence(line)  # 这里的返回值是字符串
    outputs.write(line_seg + '\n')
outputs.close()
inputs.close()

2.停用词表去重

从网上收集来的停用词可能有重复的，下面的代码去重

# 停用词表按照行进行存储，每一行只有一个词语
# python3
def stopwd_reduction(infilepath, outfilepath):
    infile = open(infilepath, 'r', encoding='utf-8')
    outfile = open(outfilepath, 'w')
    stopwordslist = []
    for str in infile.read().split('\n'):
        if str not in stopwordslist:
            stopwordslist.append(str)
            outfile.write(str + '\n')


stopwd_reduction('./test/stopwords.txt', './test/stopword.txt')

3停用词词表

根据自己的需要合并的中文停用词词表，需要的可以下载下载地址

猜你喜欢

转载自blog.csdn.net/u010105243/article/details/53363416

使用jieba分词对中文文档进行分词|停用词去重

python使用jieba实现中文文档分词和去停用词

jieba ：分词去停用词 stop words

IKAnalyzer进行中文分词和去停用词

jieba分词的停用词问题

Python学习（二）利用jieba分词及去停用词

IKAnalyzer使用停用词词典进行分词

分词去停用词操作

结巴分词----去停用词

使用jieba分词并去除停用词流程程序

中文分词与停用词的作用

python进行分词、去停用词和统计词频

『NLP自然语言处理』中文文本的分词、去标点符号、去停用词、词性标注

分词去停用词词频统计

使用jieba进行数据预处理（分词，过滤停用词及标点，获取词频、关键词等）

NLP 学习 task2 - jieba、分词、去停用词、词频统计

结巴分词 python结巴分词、jieba加载停用词表 python结巴分词、jieba加载停用词表

自然语言处理爬过的坑：使用python结巴对中文分词并且进行过滤，建立停用词。常见的中文停用词表大全

python结巴分词、jieba加载停用词表

jieba分词，去除停用词并存入txt文本

Lucene分词器，使用中文分词器，扩展词库，停用词

python中读入文件jieba分词，使用字典和停用词，再将结果写入文件

jieba分词三种分词模式、用户自定义词典、停用词词典的使用

python实现中文文档jieba分词和分词结果写入excel文件

基于spark环境的中文文档词频统计程序（去停用词）

实践：jieba分词和pkuseg分词、去除停用词、加载预训练词向量

邮件分词去掉停用词

Lucene入门级笔记五 -- 分词器，使用中文分词器，扩展词库，停用词 .

python中使用jieba进行中文分词

NLP之jieba中文分词官方文档

今日推荐

周排行

Leetcode简单题61~80

解决zookeeper磁盘IO高的问题

多线程相关方法详解

Maven-setting.xml文件详解

Maven 项目的 classpath 理解

渊亭科技大数据笔试题

配置JVM内存分配

计算机网络个人学习笔记（三）网络层：第三部分连载

js中两个等号(==)和三个等号(===)的区别

用C程序自动打开电脑上的程序

每日归档

更多

2024-09-18(0)

2024-09-17(0)

2024-09-16(0)

2024-09-15(0)

2024-09-14(0)

2024-09-13(0)

2024-09-12(0)

2024-09-11(0)

2024-09-10(0)

2024-09-09(0)