基于关键词的文本排序检索系统

文章目录

一、问题描述
二、需求分析
三、TF-IDF模型的实现

（1）思路
（2）代码实现

（2.1）计算TF
（2.2）计算IDF
（2.3）计算TF-IDF

四、主函数的实现
五、其他函数的实现

（1）文本库加载函数
（2）文本库处理函数

（2.1）分词及停用词的处理
（2.2）数据处理的主体部分
（2.3）dealDataSet()函数的完整代码

（3）导出结果文本函数

六、源代码

一、问题描述

在这里插入图片描述

二、需求分析

拿到题目首先就要知道什么是TF-IDF模型，TF-IDF是一个计算权值的算法，权值用于衡量关键词对于某篇文章的重要性（相关度），从而可以对指定关键词按照tf-idf值来对文本进行排序。起初为了方便，我准备使用python中的nltk库中的函数来完成分词和计算权值等操作，但是后来因为下载nltk中某个库失败，于是我就自己完成了tf-idf的代码实现，因为还需要完成停用词的处理，最后还是调用了nltk.corpus中的stopwards库。

三、TF-IDF模型的实现

（1）思路

刚开始实现tf-idf计算了文本库中所有单词的所有tf-idf，小规模的文本库没什么影响，但是实际情况是文本库规模很大，用户没有输入的单词的数据多余，会大大增加时间复杂度和空间复杂度，所以经过思考我准备调整思路。最终的实现思路是:用户输入的检索单词，然后调用函数分别计算出此单词在各个文本的tf值，此单词在文本库中idf值，并计算出tf-idf值，从而实现了高效检索。

（2）代码实现

代码实现涉及三个函数，分别计算tf值，idf值，还有一个函数调用前两个函数计算tf-idf值。这三个函数都是相同的形式参数，in_word是需要检索的单词，words_num_dic是一个文本库中所有文档对应的单词词数字典，结构：{txt1:{word1:num1,word2:num2},txt2:{word1:num3,word3:num4},…｝。

（2.1）计算TF

计算tf值的步骤就是先计算各个文本的总次数，然后计算检索单词在各个文档中的出现的词数，再取两者的商，就是该单词的TF值，返回值是该检索词在各个文档的tf值的词典

def computeTF(in_word, words_num_dic):
    """
    计算单词in_word在每篇文档的TF

    :param in_word: 单词
    :param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
    :return: tfDict: 单词in_word在所有文本中的tf值字典 ｛文件名1：tf1,文件名2：tf2,...｝
    """
    allcount_dic = {}   # 各文档的总词数
    tfDict = {}     # in_word的tf字典
    # 计算每篇文档总词数
    for filename, num in words_num_dic.items():
        count = 0
        for value in num.values():
            count += value
        allcount_dic[filename] = count
    # 计算tf
    for filename, num in words_num_dic.items():
        if in_word in num.keys():
            tfDict[filename] = num[in_word] / allcount_dic[filename]
    return tfDict

（2.2）计算IDF

先计算出总文档数，再计算包含检索词的文档个数，对两者求商再取对数，对分母加1处理（目的是防止分母等于0）并返回该结果。一个单词的IDF值只与整个文本库有关，换言之，一个单词在固定文本库中的IDF值固定，所以返回结果是个数。

def computeIDF(in_word, words_num_dic):
    """
    计算in_word的idf值

    :param in_word: 单词
    :param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
    :return: 单词in_word在整个文本库中的idf值
    """
    docu_count = len(words_num_dic)     # 总文档数
    count = 0
    for num in words_num_dic.values():
        if in_word in num.keys():
            count += 1
    return math.log10((docu_count) / (count + 1))

（2.3）计算TF-IDF

调用前两个函数，计算此单词在各个文档的tf-idf值，返回一个字典

def computeTFIDF(in_word, words_num_dic):
    """
    计算in_word在每篇文档的tf-idf值

    :param in_word: 单词
    :param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
    :return: tfidf_dic:单词in_word在所有文本中的tf-idf值字典 ｛文件名1：tfidf1,文件名2：tfidf2,...｝
    """
    tfidf_dic = {}
    idf = computeIDF(in_word, words_num_dic)
    tf_dic = computeTF(in_word, words_num_dic)
    for filename, tf in tf_dic.items():
        tfidf_dic[filename] = tf * idf
    return tfidf_dic

四、主函数的实现

首先将文本库的所有文本内容加载到程序中，并将加载的数据进行处理（包括分词、去除停用词等操作）得到一个记录各文档中各单词词数的字典，和一个文本库的总词库。文本库处理完毕后，用户输入一个或多个关键词，将用户的输入保存在一个list中，然后对用户的输入进行分词处理，得到若干个关键词，如果其中存在关键词在文本词库中，那么就计算此关键词在各文本中的tf-idf值，并输出该关键词的按照tf-idf值降序排列的文本序列到result1.txt文件中；如果用户输入的所有关键词都不在词库中，就输出“无任何搜索结果”。一轮搜索结束后询问用户是否继续搜索，是则继续执行上述操作，并将输出保存在result2.txt，…以此类推；否则退出程序。

if __name__ == '__main__':
    # 载入文件
    print("\t默认文本库路径为：D:/study/B4/data")
    print("\t搜索结果文本路径为：D:/study/B4/result")
    path = "D:/study/B4/data"   # 文本库路径
    all_docu_dic = loadDataSet(path)  # 加载文本库数据到程序中
    words_set, words_num_dic = dealDataSet(all_docu_dic)    # 处理数据返回值1.文本词库（已去除停用词），2.各文本词数的词典
    n = 0   # 记录搜索次数
    a = -1  # 控制程序终止的变量
    while a != 0:
        in_words = input("搜索：")
        input_list = re.split("[!? '. ),(+-=。:]", in_words)
        k = 0  # 用于记录单次输入的有效关键词的个数
        n += 1
        for i in range(len(input_list)):
            if input_list[i] in words_set:
                k += 1
                tfidf_dic = computeTFIDF(input_list[i], words_num_dic)  # 单词的tfidf未排序字典
                # 控制台输出
                print("关键词:" + input_list[i])
                print(sortOut(tfidf_dic)[0:5])  # 输出前五个相关文本
                # 文本输出
                text_save("result" + str(n) + ".txt", sortOut(tfidf_dic)[0:5], input_list[i])  # 将排序后的tfidf字典保存到文件中
        if k == 0:
            print("无任何搜索结果")
        a = input("任意键继续搜索，'0'退出:")
        print("-------------------------------------")

五、其他函数的实现

（1）文本库加载函数

这一部分主要是读取文本库位置然后动态装载文本库，将文本内容传入程序中，最后返回一个文本库字典all_docu_dic（结构：｛文本名1：文本内容1，文本名2：文本内容2…｝）

def loadDataSet(path):
    """
    读取文本库中的文本内容以字典形式输出

    :param path: 文本库地址
    :return: 文本库字典｛文本名1：文本内容1，文本名2：文本内容2...｝
    """
    # 将文件夹内的文本全部导入程序
    files = os.listdir(path)  # 得到文件夹下的所有文件名称
    all_docu_dic = {}  # 接收文档名和文档内容的词典
    for file in files:  # 遍历文件夹
        if not os.path.isdir(file):  # 判断是否是文件夹，不是文件夹才打开
            f = open(path + "/" + file, encoding='UTF-8-sig')  # 打开文件
            iter_f = iter(f)  # 创建迭代器
            strr = ""
            for line in iter_f:  # 遍历文件，一行行遍历，读取文本
                strr = strr + line
            all_docu_dic[file] = strr.strip('.')   # 去除末尾的符号.
    print("文件库：")
    print(all_docu_dic)
    return all_docu_dic

（2）文本库处理函数

由上一步文本库加载函数 loadDataSet()的加载，得到了一个字典类型的文本库。接下来就是就是通过这个函数来处理上面得到的数据。

（2.1）分词及停用词的处理

（2.1.1）分词
语句的分词使用的re模块下的split函数实现自定义分词

 cut = re.split("[!? '.),(+-=。:]", content)  # 分词

（2.1.2）停用词
停用词的处理需要调用nltk.corpus模块里的stopwords库，注释里还提供了停用词的扩展功能

stop_words = stopwords.words('english')    # 原始停用词库
    # #停用词的扩展
    # print(len(stop_words))
    # extra_words = [' ']#新增的停用词
    # stop_words.extend(extra_words)#最后停用词
    # print(len(stop_words))
new_cut = [w for w in cut if w not in stop_words if w]  # 去除停用词，并且去除split后产生的空字符

（2.2）数据处理的主体部分

此部分主要将分完词后的文本库处理得到文本库的词库all_words_set（结构：｛word1,word2,…｝）和文本词数字典words_num_dic （结构：｛txt1:{word1:num1,word2:num2},…｝）

 # 计算所有文档总词库和分隔后的词库
    for filename, content in all_docu_dic.items():
        cut = re.split("[!? '.),(+-=。:]", content)  # 分词
        new_cut = [w for w in cut if w not in stop_words if w]  # 去除停用词，并且去除split后产生的空字符
        all_docu_cut[filename] = new_cut  # 键为文本名，值为分词完成的list
        all_words.extend(new_cut)
    all_words_set = set(all_words)  # 转化为集合形式

    # 计算各文本中的词数
    words_num_dic = {}
    for filename, cut in all_docu_cut.items():
        words_num_dic[filename] = dict.fromkeys(all_docu_cut[filename], 0)
        for word in cut:
            words_num_dic[filename][word] += 1

（2.3）dealDataSet()函数的完整代码

def dealDataSet(all_docu_dic):
    """
    处理文件库字典的数据

    :param all_docu_dic:文本库字典｛文本名1：文本内容1，文本名2：文本内容2...｝
    :return: 1.all_words_set 文本库的词库｛word1,word2,...｝
             2.words_num_dic 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
    """
    all_words = []
    all_docu_cut = {}  # 分完词后的dic(dic嵌套list)

    stop_words = stopwords.words('english')    # 原始停用词库
    # #停用词的扩展
    # print(len(stop_words))
    # extra_words = [' ']#新增的停用词
    # stop_words.extend(extra_words)#最后停用词
    # print(len(stop_words))

    # 计算所有文档总词库和分隔后的词库
    for filename, content in all_docu_dic.items():
        cut = re.split("[!? '.),(+-=。:]", content)  # 分词
        new_cut = [w for w in cut if w not in stop_words if w]  # 去除停用词，并且去除split后产生的空字符
        all_docu_cut[filename] = new_cut  # 键为文本名，值为分词完成的list
        all_words.extend(new_cut)
    all_words_set = set(all_words)  # 转化为集合形式

    # 计算各文本中的词数
    words_num_dic = {}
    for filename, cut in all_docu_cut.items():
        words_num_dic[filename] = dict.fromkeys(all_docu_cut[filename], 0)
        for word in cut:
            words_num_dic[filename][word] += 1
    # print("词库：")
    # print(all_words_set)
    print("文件分词库：")
    print(all_docu_cut)
    return all_words_set, words_num_dic     # 返回词库和文档词数字典

（3）导出结果文本函数

字典类型的变量没有办法直接导出到文本中，所以需要对字典类型的变量进行额外处理，非字符类型的变量需要使用str()函数转换成字符串后才能导入文本。

def text_save(filename, data, word):
    """
    对检索词word的字典输出到filename的文件中

    :param filename:输出文本的文件名
    :param data: 字典类型
    :param word: 关键词
    """
    fp = open("D:/study/B4/" + filename, 'a')
    fp.write("关键词:" + str(word) + '\n')
    for line in data:
        for a in line:
            s = str(a)
            fp.write('\t' + s)
            fp.write('\t')
        fp.write('\n')
    fp.close()

六、源代码

import math
import os
import re
from nltk.corpus import stopwords

def loadDataSet(path):
    """
    读取文本库中的文本内容以字典形式输出

    :param path: 文本库地址
    :return: 文本库字典｛文本名1：文本内容1，文本名2：文本内容2...｝
    """
    # 将文件夹内的文本全部导入程序
    files = os.listdir(path)  # 得到文件夹下的所有文件名称
    all_docu_dic = {}  # 接收文档名和文档内容的词典
    for file in files:  # 遍历文件夹
        if not os.path.isdir(file):  # 判断是否是文件夹，不是文件夹才打开
            f = open(path + "/" + file, encoding='UTF-8-sig')  # 打开文件
            iter_f = iter(f)  # 创建迭代器
            strr = ""
            for line in iter_f:  # 遍历文件，一行行遍历，读取文本
                strr = strr + line
            all_docu_dic[file] = strr.strip('.')   # 去除末尾的符号.
    print("文件库：")
    print(all_docu_dic)
    return all_docu_dic

def dealDataSet(all_docu_dic):
    """
    处理文件库字典的数据

    :param all_docu_dic:文本库字典｛文本名1：文本内容1，文本名2：文本内容2...｝
    :return: 1.all_words_set 文本库的词库｛word1,word2,...｝
             2.words_num_dic 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
    """
    all_words = []
    all_docu_cut = {}  # 分完词后的dic(dic嵌套list)

    stop_words = stopwords.words('english')    # 原始停用词库
    # #停用词的扩展
    # print(len(stop_words))
    # extra_words = [' ']#新增的停用词
    # stop_words.extend(extra_words)#最后停用词
    # print(len(stop_words))

    # 计算所有文档总词库和分隔后的词库
    for filename, content in all_docu_dic.items():
        cut = re.split("[!? '.),(+-=。:]", content)  # 分词
        new_cut = [w for w in cut if w not in stop_words if w]  # 去除停用词，并且去除split后产生的空字符
        all_docu_cut[filename] = new_cut  # 键为文本名，值为分词完成的list
        all_words.extend(new_cut)
    all_words_set = set(all_words)  # 转化为集合形式

    # 计算各文本中的词数
    words_num_dic = {}
    for filename, cut in all_docu_cut.items():
        words_num_dic[filename] = dict.fromkeys(all_docu_cut[filename], 0)
        for word in cut:
            words_num_dic[filename][word] += 1
    # print("词库：")
    # print(all_words_set)
    print("文件分词库：")
    print(all_docu_cut)
    return all_words_set, words_num_dic     # 返回词库和文档词数字典

def computeTF(in_word, words_num_dic):
    """
    计算单词in_word在每篇文档的TF

    :param in_word: 单词
    :param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
    :return: tfDict: 单词in_word在所有文本中的tf值字典 ｛文件名1：tf1,文件名2：tf2,...｝
    """
    allcount_dic = {}   # 各文档的总词数
    tfDict = {}     # in_word的tf字典
    # 计算每篇文档总词数
    for filename, num in words_num_dic.items():
        count = 0
        for value in num.values():
            count += value
        allcount_dic[filename] = count
    # 计算tf
    for filename, num in words_num_dic.items():
        if in_word in num.keys():
            tfDict[filename] = num[in_word] / allcount_dic[filename]
    return tfDict

def computeIDF(in_word, words_num_dic):
    """
    计算in_word的idf值

    :param in_word: 单词
    :param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
    :return: 单词in_word在整个文本库中的idf值
    """
    docu_count = len(words_num_dic)     # 总文档数
    count = 0
    for num in words_num_dic.values():
        if in_word in num.keys():
            count += 1
    return math.log10((docu_count) / (count + 1))

def computeTFIDF(in_word, words_num_dic):
    """
    计算in_word在每篇文档的tf-idf值

    :param in_word: 单词
    :param words_num_dic: 文本词数字典｛txt1:{word1:num1,word2:num2},...｝
    :return: tfidf_dic:单词in_word在所有文本中的tf-idf值字典 ｛文件名1：tfidf1,文件名2：tfidf2,...｝
    """
    tfidf_dic = {}
    idf = computeIDF(in_word, words_num_dic)
    tf_dic = computeTF(in_word, words_num_dic)
    for filename, tf in tf_dic.items():
        tfidf_dic[filename] = tf * idf
    return tfidf_dic

def text_save(filename, data, word):
    """
    对检索词word的字典输出到filename的文件中

    :param filename:输出文本的文件名
    :param data: 字典类型
    :param word: 关键词
    """
    fp = open("D:/study/B4/" + filename, 'a')
    fp.write("关键词:" + str(word) + '\n')
    for line in data:
        for a in line:
            s = str(a)
            fp.write('\t' + s)
            fp.write('\t')
        fp.write('\n')
    fp.close()

def sortOut(dic):
    """
    对字典内容按照value值排序，并保留value值

    :param dic: 字典
    :return: 嵌套元组的list
    """
    return sorted(dic.items(), key=lambda item: item[1], reverse=True)

if __name__ == '__main__':
    # 载入文件
    print("\t默认文本库路径为：D:/study/B4/data")
    print("\t搜索结果文本路径为：D:/study/B4/result")
    path = "D:/study/B4/data"   # 文本库路径
    all_docu_dic = loadDataSet(path)  # 加载文本库数据到程序中
    words_set, words_num_dic = dealDataSet(all_docu_dic)    # 处理数据返回值1.文本词库（已去除停用词），2.各文本词数的词典
    n = 0   # 记录搜索次数
    a = -1  # 控制程序终止的变量
    while a != 0:
        in_words = input("搜索：")
        input_list = re.split("[!? '. ),(+-=。:]", in_words)
        k = 0  # 用于记录单次输入的有效关键词的个数
        n += 1
        for i in range(len(input_list)):
            if input_list[i] in words_set:
                k += 1
                tfidf_dic = computeTFIDF(input_list[i], words_num_dic)  # 单词的tfidf未排序字典
                # 控制台输出
                print("关键词:" + input_list[i])
                print(sortOut(tfidf_dic)[0:5])  # 输出前五个相关文本
                # 文本输出
                text_save("result" + str(n) + ".txt", sortOut(tfidf_dic)[0:5], input_list[i])  # 将排序后的tfidf字典保存到文件中
        if k == 0:
            print("无任何搜索结果")
        a = input("任意键继续搜索，'0'退出:")
        print("-------------------------------------")