python自然语言处理（一）之中文分词预处理、统计词频

一个小的尝试。。

数据源

数据集一共200条关于手机的中文评论，以XML格式存储。

分词工具

python-jieba

预处理

包括去停用词、去标点符号和数字

去停用词：使用的是他人总结的停用词表

去标点符号和数字：用正则表达式。原本打算的是中文标点符号从网上复制，英文标点符号用string.punctuation，但考虑到正则表达式中有些符号需要转义，略麻烦，就直接粗暴地用字符串表示了。

    def filterWord(word):
        stopwords = {}.fromkeys([ line.rstrip() for line in open('stopwords.txt')]);
        punctuation = "\s+\.\!\/_,$%^*(+\"\'\{\}\=]+|[-【】《》、；+——！，。：？“、~@#￥%……&*（）";
        number = "0-9";
        if(word in stopwords):
            return "";
        else:
            word = re.sub("[%s%s]+" % (punctuation, number), "", word);
            return word;

使用xml.etree解析XML文件

数据格式如下：

<Reviews>

    <Review>

        <Sentences>

            <Sentence>

                <text> 这是一个例子 </text>

            </Sentence>

            ……

        </Sentences>

    </Review>

    ……

</Reviews>

遍历方式：

    def loadFile(source_file):
        print("loading file:" + source_file);
        tree = xml.etree.cElementTree.parse(source_file);    #解析文件
        root = tree.getroot();                               #获得elementTree的根节点
        reviews = root.findall('Review');                    #获得根节点中所有标签为‘Review’的子节点
        for review in reviews:                                
            sentences = review.getchildren()[0];             #获得Review节点的第一个子节点
            for sentence in sentences:
                text = sentence.getchildren()[0].text;       #获得sentence节点的第一个子节点中的文本内容
                word_list = jieba.lcut(text);                #调用分词工具进行分词

对于分词结果中的每一个词进行预处理，即 filterWord。将预处理过的词更新到词频字典中。【词频字典：以词作为key，词频作为value】

输出词频列表

按照词频倒序排列并输出，而词频为dict中的value部分。可以通过sorted函数排序。

result = sorted(word_dict.items(), key = lambda d:d[1], reverse = True);

通过设置sorted函数中的key参数【这里的lambda表达式可以看作是一个匿名函数，冒号左边的部分为接收参数列表，冒号右边的部分为函数返回值】，来实现排序。sorted默认升序。

python自然语言处理（一）之中文分词预处理、统计词频

数据源

分词工具

预处理

使用xml.etree解析XML文件

输出词频列表

猜你喜欢