文章目录

项目的gitee地址
项目基本演示
讲解思路
一：项目相关背景
二：搜索引擎的相关宏观原理
三：搜索引擎技术栈和项目环境
四：正排索引 vs 倒排索引 - 搜索引擎具体原理
五：编写数据去标签与数据清洗的模块 Parser
六: 编写建立索引的模块 Index
七: 编写搜索引擎模块 Searcher
- 7.1 编写Search代码
八: 编写http_server模块
九: 编写前端代码
十: 工具类的编写
十一：添加日志部同时部署服务到Linux
项目的拓展方向

项目的gitee地址

项目地址，复制到浏览器打开：
https://gitee.com/xiao-jiheng/boost_search_engine

项目基本演示

主要是：服务端开启服务，客户端就可以通过浏览器进入页面进行搜索服务，搜索的内容就是
BOOST库的内容；

项目的基本目录：
在这里插入图片描述

启动项目的过程命令；

[xjh@VM-12-10-centos boost_searcher]$ make #对整个项目进行编译
[xjh@VM-12-10-centos boost_searcher]$ ./parser  #编译成功对网页进行去标签
[xjh@VM-12-10-centos boost_searcher]$ ./http_server #启动服务器

去标签目的：是对网页内容进行清洗，因为我们搜索的内容不是需要网页标签，所以需要去掉；

启动服务：服务端要提供网页资源供用户搜索，该网页资源需要构建索引；
如何验证是否启动服务器成功？

[xjh@VM-12-10-centos boost_searcher]$ netstat -nltp#查看网络状态

在这里插入图片描述

用户需要提供搜索关键字进行搜索；搜索页面如下：
默认使用端口为8081；
在这里插入图片描述

搜索结果大概如下：
在这里插入图片描述

点击任何一条链接：肯定会包含我们搜索关键字;

讲解思路

项目的相关背景
搜索引擎的相关宏观原理
搜索引擎技术栈和项目环境
正排索引 vs倒排索引–搜索引擎具体原理
编写数据去标签与数据清洗的模块Parser
编写建立索引的模块Index
编写搜索引擎模块Searcher
编写http_server模块
编写前端模块

一：项目相关背景

公司：百度、搜狗、360搜索、头条新闻客户端 - 我们自己实现是不可能的！
技术门槛高，保存海量的网络资源就是一个问题了；
更别说根据客户的关键字，对关键字排序，显示网页内容的工作；
站内搜索：搜索的数据更垂直，数据量其实更小；
boost的官网是没有站内搜索的，需要我们自己做一个；

我们写的是一个站内搜索，就是搜索资源就是boost库的内容！
展示内容就是：标题；网页内容的摘要，和url 这三个关键的信息；
并且点击内容可以跳转相关的网站
不像百度的，既有图片，还有视频，还有广告等内容，甚至对关键字搞了标红了；
我们的站内搜索仅仅是利用了搜索引擎的基本原理去完成的；
在这里插入图片描述

二：搜索引擎的相关宏观原理

首先服务器：
内部是提前准备好要搜索的资源的，该资源是通过爬虫程序爬取网络的信息，然后保存在自己服务器上的磁盘；
然后对爬取到的网页内容进行数据清晰工作，去掉标签，保留主要关键信息；
同时对爬取的内容进行建立索引，目的为了用户方便查找服务器资源，加快用户查找效率；

对于客户端，也就是浏览器，要通过GET请求方式上传自己的关键字，服务器收到后，就会对请求报文进行处理，检索关键字，得到相关资源，构建相关资源的html信息返回给用户！

在这里插入图片描述

三：搜索引擎技术栈和项目环境

技术栈: C/C++ C++11, STL, 准标准库Boost，Jsoncpp，cppjieba，cpp-httplib ;
html5， css，js、jQuery、Ajax(本项目前端技术的基本很少使用，主要在后端)
项目环境： Centos 7云服务器，vim/gcc(g++)/Makefile , vs code

cppjieba: 分词工具，主要对用户搜索关键字进行切分，切分搜索，并且返回切分搜索到的结果；同时服务器建立索引时候，也需要对关键字进行切分；
cpp-httplib:直接构建服务器的开源库；

四：正排索引 vs 倒排索引 - 搜索引擎具体原理

正排和倒排索引文章链接(直接点击就可以跳转网页资源)
上面那文章是网络搜索的，对正排倒排的解释；我自己也会解释一下，但是是简单说明！不是具体解释概念；
我所将的是正排倒排的特点，及其在搜索引擎承担什么角色任务！

正排索引：文档id和文档内容的映射关系；就是通过文档id去找到文档内容（也有人说是找到文档内的关键字）；

以后我们搜索肯定是根据关键字进行搜索文档内容的；
所以我们服务器必须对文档内容进行分词，分词目的就是为了方便建立倒排索引；

分词：
文档1[雷军买了四斤小米 ]: 雷军/买/四斤/小米/四斤小米；
文档2[雷军发布了小米手机]：雷军/发布/小米/小米手机；

这里文档1 分词就是分为了这几个部分 [雷军] [买] [四斤] [小米] [四斤小米] （举个例子这里，分词的策略有很多种的）；
我们就是通过这些分词结果对其进行倒排索引建立，方便用户更具关键字查找到内容；

在这里插入图片描述
模拟一次查找的过程：
用户输入：小米 -> 倒排索引中查找 -> 提取出文档ID(1,2) -> 根据正排索引 -> 找到文档的内容 ->
title+conent（desc）+url 文档结果进行摘要->构建响应结果；

注意：编写代码时候，我们需要构建倒排索引，构建倒排索引需要文档内容进行分词，用分词结果去构建倒排索引；
然后用户搜索时候，我们也需要对用户搜索关键字进行分词，根据分词，也即是关键字，去倒排索引找到关键字对应文档ID，再拿到文档ID去正排索引找到文档内容！

五：编写数据去标签与数据清洗的模块 Parser

先下boost库的资源到Linux中，让其作为服务器搜索资源；

boost 官网： https://www.boost.org/
//目前只需要boost_1_78_0/doc/html目录下的html文件，用它来进行建立索引

进入官网：找到该图标
在这里插入图片描述

点击下载该版本的到你的桌面（当然下载哪个版本都无所谓，只是我的boost版本就是该版本）；
在这里插入图片描述
使用命令：

[xjh@VM-12-10-centos boost_search]$ rz -E #把桌面的boost库传到Linux中；

在这里插入图片描述

成功对齐解压即可：

tar -zxvf boost_1_78_0.tar.gz #解压即可

这就是boost库的官网的内容！

在这里插入图片描述

但是我们进行站内搜索的内容：只是使用该路径的资源：

boost_1_78_0/doc/html/

里面包含boost库的所有内容！也就是该项目可以被搜索到的资源

将该文件内容拷贝到data/input目录，也就是我们boost搜素引擎的搜索内容

在这里插入图片描述

后序工作就是拿到data/input的内容，构建索引！

创建一个parser.cc文件的主要功能就是去标签的任务！

在这里插入图片描述

把去标签的内容保存再 raw.txt文档内容
在这里插入图片描述

目标：把每个data\input下的文档都去标签，然后写入到同一个raw.txt文件中！
每个文档内容不需要任何\n！文档和文档之间用 \3 区分；
XXXXXXXXXXXXXXXXX\3YYYYYYYYYYYYYYYYYYYYY\3ZZZZZZZZZZZZZZZZZZZZZZZZZ\3；

选择\3原因：它是不可显示字符，不会污染我们的数据源！

5.1 parser基本代码结构

该代码是在文件 parser.cc，的基本结构；
该文件的主要完成功能是：对所有要搜索的boost资源html文档，进行数据清洗工作；
步骤：

读取所有的该路径下const std::string src_path = "data/input";所有的html文档的名称到一个数组中保存vector<std::string> &files_list；
读取每一个html文档，也就是枚举数组vector<std::string> &files_list的每一个元素，对其进行去标签的，获取标题，文档内容，和url 三个主要的信息存储在std::vector<DocInfo_t> results;数组中；
将去标签的html文档信息从数组std::vector<DocInfo_t> results读取兵保存到在const std::string output = "data/raw_html/raw.txt";文档中；

#include <iostream>
#include <string>
#include <vector>
#include <boost/filesystem.hpp>
#include <fstream>
#include "util.hpp"

const std::string src_path = "data/input";
const std::string output = "data/raw_html/raw.txt";

typedef struct DoInfo
{
    
    
  std::string title;   //文档的标题
  std::string content; //文档的内容
  std::string url;     //文档在官网url
} DocInfo_t;

//函数参数命名规范小细节;
/*
 * const& :输入参数
 * * :输出参数
 * & :输入输出参数
 * */

bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list);

bool ParseHtml(const std::vector<std::string> &files_list, std::vector<DocInfo_t> *results);

bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output);

int main()
{
    
    
  std::vector<std::string> files_list; //保存 src_path路径下所有的html文件名

  // 1.递归式的把src_path路径下的所有文件名(带路径的)保存在files_list,目的方便后期读取
  if (!EnumFile(src_path, &files_list))
  {
    
    
    std::cerr << "enum file name error" << std::endl;
    return 1;
  }

  // 2.对每个文件html文件进行读取其内容，并解析出结果存放在DocInfo结构体中
  std::vector<DocInfo_t> results;

  if (!ParseHtml(files_list, &results))
  {
    
    
    std::cerr << "parse html error" << std::endl;
    return 2;
  }

  // 3.将解析到的各个文档的DocInfo信息存放到output文件中，并通过\3作为每个文档解析结果进行分割
  if (!SaveHtml(results, output))
  {
    
    
    std::cerr << "save html error" << std::endl;
    return 3;
  }

  return 0;
}

5.2 使用boost库函数枚举每个html文件名

有了5.1小节的基本结构parser.cc文件清洗数据基本结构，接下来就完成每一步的细节；
5.2小节就是完成bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list);该函数的；
该函数的功能就是：枚举src_path路径下的所有文件，并把读取的.html文件名结尾的文件保存在files_list当中;

说白了就是该路径以html结尾的文件，读取到内存中；
在这里插入图片描述

该函数的具体实现代码：

bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list)
{
    
    
  namespace fs = boost::filesystem;
  fs::path root_path(src_path); // root_path是开始递归搜索的根目录路径
  //判断搜索的文件路径是否是存在
  if (!fs::exists(root_path))
  {
    
    
    std::cerr << src_path << " not exists " << std::endl;
    return false;
  }
  //递归遍历root_path
  fs::recursive_directory_iterator end; //空迭代器，用来判断递归结束标志
  for (fs::recursive_directory_iterator it(root_path); it != end; it++)
  {
    
    
    //遍历的文件：需要拿到的是普通文件，目录和其他文件就不处理
    if (!fs::is_regular_file(*it))
      continue;

    //是普通文件还要判断是否为html文件
    if (it->path().extension() != ".html") // extension获取文件名的后缀
      continue;

    // std::cout << "debug: " << it->path().string() << std::endl;
    //来到这里肯定是一个合法以.html结尾的合法文件

    files_list->push_back(it->path().string());
  }
  return true;
}

当然里面使用了很多是boost库提供的函数；我是用的是boost 1.53版本的函数；

5.3 解析html代码编写

当我们获取到html文档的每个文件名，就需要对其每个html文档进行解析；
要解析之前，肯定要根据每个html文档的文件名进行读取html的文档，再对其解析；
解析获取三个信息：标题，内容，url 即可；

该模块主要是完成：bool ParseHtml(const std::vector<std::string> &files_list, std::vector<DocInfo_t> *results);函数的编写；

bool ParseHtml(const std::vector<std::string> &files_list, std::vector<DocInfo_t> *results)
{
    
    

  for (const std::string &file : files_list)
  {
    
    
    // 1.读取文件名file的内容
    std::string result;
    if (!ns_util::FileUtil::ReadFile(file, &result))
      continue;

    DocInfo_t doc;
    // 2.解析内容获得title
    if (!ParseTitle(result, &doc.title))
      continue;
    // 3.解析内容获取content
    if (!ParseContent(result, &doc.content))
      continue;
    // 4.解析内容获取url
    if (!ParseUrl(file, &doc.url))
      continue;
      
    //来到这里说名：解析一个文件内容成功，当前解析结果放在doc中
    // results->push_back(doc); //小细节：push_back扩容会发送拷贝，效率低
    results->push_back(std::move(doc)); //这个doc内容太大了，并且是临时对象，我们可以直接移动构造很棒，减少拷贝
  return true;
}
//*****************************************************//
static bool ParseTitle(const std::string &file, std::string *title)
{
    
    
  std::size_t begin = file.find("<title>");
  if (begin == std::string::npos)
    return false;

  std::size_t end = file.find("</title>");
  if (end == std::string::npos)
    return false;

  begin += std::string("<title>").size();

  if (begin > end)
    return false;

  *title = file.substr(begin, end - begin);
  return true;
}

//参数file是一个html文件的内容（还没被解析的html文件内容）
static bool ParseContent(const std::string &file, std::string *content)
{
    
    

  //去标签，基于简单的的状态机编写
  enum status
  {
    
    
    LABLE,  //标签
    CONTENT //内容
  };

  enum status s = LABLE; //默认的所有html网页刚开始的字符串肯定是标签
  for (char c : file)    //遍历html网页的内容里面的每一个字符
  {
    
    
    //检测状态
    switch (s)
    {
    
    
      //当我们读到的是标签，也就是处于LABLE状态，那么我们什么都不做，继续读取下一个
      //什么时候该LABLE状态结束呢？当读取到'>'表示LABLE状态结束
    case LABLE:
      if (c == '>')
        s = CONTENT;
      break;
      //处于CONTENT状态就把读取到的字符假如content,
      //什么时候该CONTENT状态结束呢？只要碰到'<'就表示结束了
    case CONTENT:
      if (c == '<')
        s = LABLE;
      else
      {
    
    
        //读取到的字符可能有\n，我们不希望保留，因为要做html解析后文本的分隔符
        if (c == '\n')
          c = ' '; //小细节：源文档的file的\n是没有被修改的，这里遍历file拿到的c字符串不是引用，所以不会修改
        content->push_back(c);
      }
      break;

    default:
      break;
    }
  }
  return true;
}

//file_path：就是要查询的html文档在我们Linux的 ./data/input/ 目录下的文件路径;
static bool ParseUrl(const std::string &file_path, std::string *url)
{
    
    
  std::string url_head = "https://www.boost.org/doc/libs/1_78_0/doc/html";
  std::string url_tail = file_path.substr(src_path.size());
  *url = url_head + url_tail;

  return true;
}

该代码的基本四个逻辑：
1. 读取每一个html文档；
2. 解析html的标题；
3. 解析html的内容；
4. 解析html的url；

如何读取每个html文档？
根据每个文件名（带路径的html文件名）按行读即可；

如何解析html的标题？
其实标题就是在<title>head...</title> 标前里面：只要我们读取到该标签的下标，对其进行截取内容即可；
在这里插入图片描述

如何解析html的内容？

这里是使用的方式是基于简易的状态机编写；
从头开始遍历html的文档内容，读取到标签左尖括号< 就认为是标签，其实也是读取内容的结束位置，读取到右尖括号>就是读取标签结束,也是读取真正内容的开始；

如何解析html的url呢？

boost库的官方文档，和我们下载下来的文档，是有路径的对应关系的

官网URL样例：    https://www.boost.org/doc/libs/1_78_0/doc/html/accumulators.html
我们下载下来的url样例：boost_1_78_0/doc/html/accumulators.html
我们拷贝到我们项目中的样例：data/input/accumulators.html //我们把下载下来的boost库 doc/html/* copy
data/input/
url_head = "https://www.boost.org/doc/libs/1_78_0/doc/html";
url_tail = [data/input](删除) /accumulators.html -> url_tail = /accumulators.html
url = url_head + url_tail ; 相当于形成了一个官网链接

如何保存html文件呢？

其实就是读取解析到的html文件数组 std::vector<DocInfo_t> results;到const std::string output = "data/raw_html/raw.txt";文件中；
但是我们要处理 标题内容 url 之间的间隔，以\3作为分隔符；方便日后读取；

六: 编写建立索引的模块 Index

在第五个模块中，我们成功对我们要搜索的资源进行了数据清理，并将清理的所有html文件读取到了
一个文件const std::string output = "data/raw_html/raw.txt中；
接下来我们就需要根据该文件的内容进行建立索引；

该模块的内容是：在index.hpp文件中;

我们该模块的结构大概如下：

设计正排节点 struct DocInfo和倒排节点 InvertedElem；
设计倒排索引 std::unordered_map<std::string, InvertedList> inverted_index;和正排索引 std::vector<DocInfo> forword_index;结构；
提供获取正排索引函数 DocInfo* GetForwordIndex(uint64_t doc_id); 和倒排索引的函数 InvertedList* GetInvertedList(std::string& word)；
提供建立索引的函数 bool BulidIndex(const std::string& input)；
提供建立倒排索引函数 bool BuildInvertedIndex(const DocInfo& doc); 和正排索引的函数 DocInfo* BulidForWordIndex(const std::string& line);
对索引设计为单例模式；

具体函数说明和设计思想理解如下：

#pragma once
#include <string>
#include <vector>
#include <unordered_map>
#include<iostream>
#include<fstream>
#include<unordered_map>
#include<mutex>
#include"util.hpp"
#include"log.hpp"


namespace ns_index
{
    
    
  //由于要设计正排索引,也就是根据文档id找到文档内容，那么文档内容就需要用一个结构体去描述，所以设计出DocInfo
    struct DocInfo //文档内容
    {
    
    
        std::string title;
        std::string content;
        std::string url;
        uint64_t doc_id;
    };

  //由于要设计倒排索引,也就是根据关键词找到文档id，那么需要用一个结构体去描述，所以设计出InvertedElem
    struct InvertedElem
    {
    
    
        int doc_id;
        std::string word;
        int weight;
    };

    //倒排拉链
    typedef std::vector<InvertedElem> InvertedList;

    class Index
    {
    
    
    private:
        /*设计正排索引：使用数据的结构来设计*/

        //正排索引：下标天然就是文档ID ID快速找-->文档内容
        std::vector<DocInfo> forword_index;

        //倒排索引：通过关键字-->快速找到对应的文档
        /*倒排索引中，一个关键字，对应多个文档id*/
        //(我们只要拿到一个关键字，就可以拿到一个vector,这个vector每个节点就是到倒排节点，也是文档id啦)
        std::unordered_map<std::string, InvertedList> inverted_index;

        static Index* instance; 
        static std::mutex mtx;
    private:
        Index(){
    
    }
        Index(const Index& ) = delete;
        Index& operator=(const Index& )=delete;
    public:
      static Index* GetInstance()
        {
    
    
          if(nullptr == instance)
          {
    
    
           mtx.lock();
           if(nullptr == instance)
            {
    
    
              instance  = new Index();
            }
            mtx.unlock();
          }
          return instance;
        }
        ~Index(){
    
    }
        public:
        //根据ID找到文档内容(也就是根据doc_id找到正排索引节点)
        DocInfo* GetForwordIndex(uint64_t doc_id);
        //根据关键字找到倒排拉链
        InvertedList* GetInvertedList(std::string& word)
        //建立索引（正排索引和倒排索引）
        //根据传入的parser.cc函数处理完毕的/data/raw_html/raw.txt文件，构建索引
        bool BulidIndex(const std::string& input); //根据input文档内容构建索引
        private:
        //就是读到的line构建DocInfo,再插入到vector<DocInfo>这个正排索引中
        //构建成功后，我们就可以直接根据doc_id快速查到文档内容DocInfo了
        DocInfo* BulidForWordIndex(const std::string& line); //line就是row.txt每一行的内容
        //对建立好的正排索引的一个结构DocInfo进行处理：做建立倒排索引
        bool BuildInvertedIndex(const DocInfo& doc);        
};
     Index* Index::instance = nullptr; 
     std::mutex Index::mtx;

6.1 获取正排索引和倒排拉链函数具体实现

具体函数实现：

 //根据ID找到文档内容(也就是根据doc_id找到正排索引节点)
        DocInfo* GetForwordIndex(uint64_t doc_id)
        {
    
    
            if(doc_id >=forword_index.size())
            {
    
    
                std::cerr<<"doc_id out range error!"<<std::endl;
                return nullptr;
            }
            return &forword_index[doc_id];
        }

 //根据关键字找到倒排拉链
        InvertedList* GetInvertedList(std::string& word)
        {
    
    
            if(inverted_index.find(word) == inverted_index.end())
            {
    
    
                std::cerr<<word<<" have no InvertedList"<<std::endl;
                return nullptr;
            }
            return &inverted_index[word]; //&(iter->second)
        }

6.2 构建索引具体函数实现

构建索引：该函数，其实挺复杂的，把功能才分三部分：1，读取文件，2.建立正排，4.根据正排建立倒排

//建立索引（正排索引和倒排索引）
        //根据传入的parser.cc函数处理完毕的/data/raw_html/raw.txt文件，构建索引
        bool BulidIndex(const std::string& input) //根据input文档内容构建索引
        {
    
    
            //读取input的每一行进行建立索引

            //1.打开要进行建立索引的文件
            std::ifstream in(input,std::ios::in | std::ios::binary);
            if(!in.is_open())
            {
    
    
                std::cerr<<"open "<<input<<" filed!"<<std::endl;
                return false;
            }
            //2.对每一行进行内容进行建立索引（其实就是每一个html被解析的文件建立索引）
            std::string line; //这个line-->  tile\3content\3url\n
            int count =0;
            while(std::getline(in,line))
            {
    
       
                //3. 建立正排索引
                DocInfo* doc = BulidForWordIndex(line);
                if(doc == nullptr)
                {
    
    
                    std::cerr<<"sorry:...\n"<<line<<"\nerror"<<std::endl;//for debug
                    continue;
                }
                //4. 根据正排再建立倒排
                BuildInvertedIndex(*doc);
                //for debug
                count++;
                if(count %50==0)
                {
    
    
                  LOG(NORMAL,"当前已经建立的索引文档："+std::to_string(count));
                }
            }
            in.close();
            return true;
        }

6.3 构建正排索引具体函数实现

建立正排索引的函数其实是建立索引函数里面的一个子功能；
在建立索引的函数中，我们是读取raw.txt文档的每一行进行进行建立正排索引的；也就是说遍历raw.txt文档所有内容，每读取一行就建立一个正排索引，更加准确地说，是读取每一个html文档，被解析过的html文档进行建立倒排索引；

此时：我们需要对之前进行数据清理时候的文档进行切分，因为我们之前对html清理为了三部分：
标题，内容，url 都是以\3区分，所以我们要以\3进行分隔符切分，获取内容，插入到正排索引数组中；

//构建正排索引本质就是读到的line构建DocInfo,再插入到vector<DocInfo>这个正排索引中
        //构建成功后，我们就可以直接根据doc_id快速查到文档内容DocInfo了
        DocInfo* BulidForWordIndex(const std::string& line) //line就是row.txt每一行的内容
        {
    
    
            //解析line,-->分割line-->title content url
            //解析本质就是切分字符串
            std::vector<std::string> results; //切分字符串存放的数组
            const std::string sep = "\3";
            ns_util::StringUtil::Split(line,&results,sep);

            if(results.size() !=3)
                return nullptr;
            //解析结果插入到DocInfo
            DocInfo doc;
            doc.title = results[0];
            doc.content = results[1];
            doc.url = results[2];

            doc.doc_id = forword_index.size();
            //将DocInfo插入到vector<DocInfo>
            forword_index.push_back(std::move(doc));
            return &forword_index.back();
        }

6.4 构建倒排索引具体函数实现

到底如何建立倒排索引呢？

1.由于根据正排索引获取到了文档的标题内容 url；
2. 根据该标题和内容进行分词得到关键字，同时统计词频，建立关键字和词频映射关系；
分词使用的库文件cppjieba分词库,该分词库是一个hander only 的开源库;
4. 根据分词的关键字，构建倒排拉链，并且构建倒排索引；

具体分析和实现，看代码：

        //对建立好的正排索引的一个结构DocInfo进行处理：做建立倒排索引
        bool BuildInvertedIndex(const DocInfo& doc)
        {
    
    
          //建立完正排索引之后,拿到doc，也就是[tile content url doc_id] 建立关键字和doc之间的联系
          
          //1.对tile 和 content 进行分词(分词就是获取关键字，建立倒排索引)并且统计分词结果的词频率
          
          struct word_cnt
          {
    
    
            int title_cnt; //标题词频
            int content_cnt; //内容词频
            word_cnt():title_cnt(0),content_cnt(0){
    
    }
          };

          
          std::unordered_map<std::string,word_cnt> word_map; //存放title 和 content 分词后的关键字和词频映射关系
          //对标题进行分词
          std::vector<std::string> title_words;//对title分词的结果
          ns_util::JiebaUtil::CurString(doc.title,&title_words);
          
          //遍历title分词出的结果进行词频统计
          for(std::string word : title_words) //这里不加&原因是：转化小写，不想修改原文档的内容
          {
    
    
            boost::to_lower(word);
            word_map[word].title_cnt++;  
          }
          //对内容进行分词
          std::vector<std::string> content_words;
          ns_util::JiebaUtil::CurString(doc.content,&content_words);

          for(std::string word : content_words)
          {
    
    
            boost::to_lower(word);
            word_map[word].content_cnt++;
          }
          
#define X 10
#define Y 1
          /*小细节：用户输入的是关键字：hello HELLO HEllO.... 等这关键字是否有区别？
           *实际搜索引擎是不做区分大小写，也就是你收缩的词是大小写，我们返回给你的信息可以不做区分
           *
           * 所以我们文档出现的词，在我切词做词频统计，还有建立倒排索引时候，是需要忽略大小写的
           *
           * 结论：对用户来说：搜索关键字是不区分大小写
           *       对我们编写代码来说：如何做到，对分词结果转小写，这样搜索引擎不区分大小写
           *       用户输入大小写，如何使其不区分？那就是在我们的倒排索引中，把用户输入的也转为小写即可
           *       这样用户的词不管是大小写都变成小写，那么就可以拿到用户的关键词去倒排索引查找了
           * */
          //对title和content的分词后得到的关键字进行建立倒排拉链
          for(auto& word_pair : word_map)
          {
    
    
              InvertedElem item; //倒排索引的一个元素
              item.doc_id = doc.doc_id;//因为我们是在一个文档内进行建立倒排索引，所以这里倒排索引的id就是该文档id
              item.word = word_pair.first; //分词得到的关键字
              item.weight = X*word_pair.second.title_cnt+Y*word_pair.second.content_cnt;//相关性

              
              //inverted_index是map结构
              //建立关键字和一个或者多个item的映射（因为一个关键字，可能对应多个文档），其实就是关键字和倒排索引的映射
              //map[]重载： key存在就读取，没有插入
              InvertedList &inverted_list = inverted_index[word_pair.first];//这是把关键字添加到倒排索引中
              //把item添加到倒排拉链中
              inverted_list.push_back(std::move(item)); //给倒排拉链添加item

          }
            return true;
        }

七: 编写搜索引擎模块 Searcher

在前面我们完成了，对后端的数据进行了索引的建立，建立完成索引不是目的，建立索引之后提供的搜索服务才是目的；所以我们需要完成一个新的模块功能：sercher.hpp;该模块就是根据用户提交的搜索关键字，提供搜索服务并返回结果给用户的功能；

基本结构代码：

#include "index.hpp"
	//对搜索结果去重的
  //搜索关键字，被jieba分词后，多个分词对应同一个文档，那么该搜索结果应该合并
  struct InvertedElemPrint
  {
    
    
    uint64_t doc_id;                //多个分词对应一个doc_id,
    int weight;                     //对多个分词的权重累加
    std::vector<std::string> words; //对分词进行处理放在一起
    InvertedElemPrint() : doc_id(0), weight(0) {
    
    }
  };
  
namespace ns_searcher{
    
    
  class Searcher{
    
    
    private:
      ns_index::Index *index; //供系统进行查找的索引
    public:
      Searcher(){
    
    }
      ~Searcher(){
    
    }
    public:
     void InitSearcher(const std::string &input)
    {
    
    
      // 1.获取或者创建index对象
      index = ns_index::Index::GetInstance();
      LOG(NORMAL, "获取索引单例对象成功...");
      // 2.根据index对象创建索引:
      index->BulidIndex(input);
      LOG(NORMAL, "建立倒排索引和正排索引成功...");
    }
      //query: 搜索关键字
      //json_string: 返回给用户浏览器的搜索结果
      void Search(const std::string &query, std::string *json_string)
     {
    
    
        //1.[分词]:对我们的query进行按照searcher的要求进行分词
        //2.[触发]:就是根据分词的各个"词"，进行index查找
        //3.[合并排序]：汇总查找结果，按照相关性(weight)降序排序
        //4.[构建]:根据查找出来的结果，构建json串 -- jsoncpp
     }
 };
}

7.1 编写Search代码

该模块代码最主要是对收缩结果去重：
因为用户提交的关键字：被jieba分词后，可能会得到多个关键字对应同一个倒排拉链；
意味着有不同关键字会对应同一个文档id；此时我们就需要去掉不同关键字，相同的重复文档；
也就是说：只保留一份文档，即使关键字不同的情况下；

/*
     *   该函数功能：主要是提供给用户进行搜索的服务
     *  query是搜索的关键字
     *  json_string 返回给用户的搜索结果
     * */
    void Search(const std::string &query, std::string *json_string)
    {
    
    
      //[分词]：对用户的关键字进行分词
      std::vector<std::string> words;
      ns_util::JiebaUtil::CurString(query, &words);

      //[触发]：根据分词的结果的各个词，进行index 查找
      std::vector<InvertedElemPrint> inverted_list_all; //存放被去重过的倒排结点
      for (std::string word : words) //遍历用户的搜索语句的分词后的每一个关键字
      {
    
    
        boost::to_lower(word); //同意转换为小写再搜索：目的就是为了保证不区分大小写的搜索

        //通过关键字先找到关键字对应的倒排拉链
        ns_index::InvertedList *inverted_list = index->GetInvertedList(word);
        if (nullptr == inverted_list) //假如用户搜索关键字找不到对应的倒排拉链，就没必要再搜索该关键字了
          continue;
        //来到这里肯定找到了关键字的倒排拉链
        //有了倒排拉链肯定就能有文档的id,那么就可以查正排索引找到文档内容了

        std::unordered_map<uint64_t, InvertedElemPrint> tokens_map; 

        //遍历每个关键字倒排拉链的结点(也就是倒排索引节点InvertedElem：包含id,weight,word)
        for (const auto &elem : *inverted_list)
        {
    
    
          InvertedElemPrint &item = tokens_map[elem.doc_id]; //根据倒排结点的doc_id获取到InvertedElemPrint结点

          item.doc_id = elem.doc_id;
          item.weight += elem.weight;
          item.words.push_back(elem.word); //一个关键字对应的倒排拉链中的每个倒排索引的关键字都是一样的
        }

        //将不重复的打印倒排拉链结点放到inverted_list_all中
        for (const auto &item : tokens_map)
        {
    
    
          inverted_list_all.push_back(std::move(item.second));
        }
      //[合并排序]：汇总查找结果，按相关性进行降序排序
      sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint &e1, const InvertedElemPrint &e2)
           {
    
     return e1.weight > e2.weight; });
      Json::Value root; //存放键值对的集合，也就是json结构串的集合
      for (auto &item : inverted_list_all) //item是用户搜索的query所分词得到关键字对应的InvertedElemPrint
      {
    
    
        //根据找到的倒排结点item里面的doc_id拿到了文档内容
        ns_index::DocInfo *doc = index->GetForwordIndex(item.doc_id);
        if (nullptr == doc)
          continue;
        // doc就是包含的你关键字对应文档的信息
        //构建json_string
        Json::Value elem;
        elem["title"] = doc->title;
        elem["desc"] = GetDesc(doc->content, item.words[0]);
        elem["url"] = doc->url;
        root.append(elem);
      }
      //对搜索结果doc进行序列化
      Json::FastWriter writer;
      *json_string = writer.write(root);
    }

八: 编写http_server模块

该模块主要是对外提供http服务的；
使用的开源库是：cpp-httplib;

#include "searcher.hpp"
#include "cpp-httplib/httplib.h"

const std::string input = "data/raw_html/raw.txt";
const std::string src_path = "./wwwroot"; //这是我们的web根目录
int main()
{
    
    

  ns_searcher::Searcher search;
  search.InitSearcher(input); //构建索引单例，同时构建索引

  httplib::Server srv;
  srv.set_base_dir(src_path.c_str()); //默认访问的是web根目录

  //分析url
  srv.Get("/s", [&search](const httplib::Request &req, httplib::Response &resp){
    
    
        if(!req.has_param("word")){
    
    
          resp.set_content("url必须带有参数word!","text/plain; charset=utf-8");
            return;
        }

      //1. 用户提交的url上有有关键字
      std::string word = req.get_param_value("word");//获得用户提交的参数
      LOG(NORMAL,"用户在搜索的关键字："+word);

      //2. 给用户提供搜索服务
      std::string json_string;
      search.Search(word,&json_string);

      //3. 将搜索结果返回给用户
      resp.set_content(json_string.c_str(),"application/json"); });

  LOG(NORMAL, "服务器启动成功...");
  srv.listen("0.0.0.0", 8081);

  return 0;
}

九: 编写前端代码

前端代码主要是提供一个简单的搜索窗口供用户进行搜索；

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <script src="http://code.jquery.com/jquery-2.1.1.min.js"></script>

    <title>boost 搜索引擎</title>
    <style>
        /* 去掉网页中的所有的默认内外边距，html的盒子模型 */
        * {
    
    
            /* 设置外边距 */
            margin: 0;
            /* 设置内边距 */
            padding: 0;
        }
        /* 将我们的body内的内容100%和html的呈现吻合 */
        html,
        body {
    
    
            height: 100%;
        }
        /* 类选择器.container */
        .container {
    
    
            /* 设置div的宽度 */
            width: 800px;
            /* 通过设置外边距达到居中对齐的目的 */
            margin: 0px auto;
            /* 设置外边距的上边距，保持元素和网页的上部距离 */
            margin-top: 15px;
        }
        /* 复合选择器，选中container 下的 search */
        .container .search {
    
    
            /* 宽度与父标签保持一致 */
            width: 100%;
            /* 高度设置为52px */
            height: 52px;
        }
        /* 先选中input标签， 直接设置标签的属性，先要选中， input：标签选择器*/
        /* input在进行高度设置的时候，没有考虑边框的问题 */
        .container .search input {
    
    
            /* 设置left浮动 */
            float: left;
            width: 600px;
            height: 50px;
            /* 设置边框属性：边框的宽度，样式，颜色 */
            border: 1px solid black;
            /* 去掉input输入框的有边框 */
            border-right: none;
            /* 设置内边距，默认文字不要和左侧边框紧挨着 */
            padding-left: 10px;
            /* 设置input内部的字体的颜色和样式 */
            color: #CCC;
            font-size: 14px;
        }
        /* 先选中button标签， 直接设置标签的属性，先要选中， button：标签选择器*/
        .container .search button {
    
    
            /* 设置left浮动 */
            float: left;
            width: 150px;
            height: 52px;
            /* 设置button的背景颜色，#4e6ef2 */
            background-color: #4e6ef2;
            /* 设置button中的字体颜色 */
            color: #FFF;
            /* 设置字体的大小 */
            font-size: 19px;
            font-family:Georgia, 'Times New Roman', Times, serif;
        }
        .container .result {
    
    
            width: 100%;
        }
        .container .result .item {
    
    
            margin-top: 15px;
        }

        .container .result .item a {
    
    
            /* 设置为块级元素，单独站一行 */
            display: block;
            /* a标签的下划线去掉 */
            text-decoration: none;
            /* 设置a标签中的文字的字体大小 */
            font-size: 20px;
            /* 设置字体的颜色 */
            color: #4e6ef2;
        }
        .container .result .item a:hover {
    
    
            text-decoration: underline;
        }
        .container .result .item p {
    
    
            margin-top: 5px;
            font-size: 16px;
            font-family:'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
        }

        .container .result .item i{
    
    
            /* 设置为块级元素，单独站一行 */
            display: block;
            /* 取消斜体风格 */
            font-style: normal;
            color: green;
        }
    </style>
</head>
<body>
    <div class="container">
        <div class="search">
            <input type="text" value="请输入搜索关键字">
            <button onclick="Search()">搜索一下</button>
        </div>
        <div class="result">
        </div>
    </div>
    <script>
        function Search(){
    
    
            // 是浏览器的一个弹出框
            // alert("hello js!");
            // 1. 提取数据, $可以理解成就是JQuery的别称
            let query = $(".container .search input").val();
            console.log("query = " + query); //console是浏览器的对话框，可以用来进行查看js数据

            //2. 发起http请求,ajax: 属于一个和后端进行数据交互的函数，JQuery中的
            $.ajax({
    
    
                type: "GET",
                url: "/s?word=" + query,
                success: function(data){
    
    
                    console.log(data);
                    BuildHtml(data);
                }
            });
        }

        function BuildHtml(data){
    
    
            // 获取html中的result标签
            let result_lable = $(".container .result");
            // 清空历史搜索结果
            result_lable.empty();

            for( let elem of data){
    
    
                // console.log(elem.title);
                // console.log(elem.url);
                let a_lable = $("<a>", {
    
    
                    text: elem.title,
                    href: elem.url,
                    // 跳转到新的页面
                    target: "_blank"
                });
                let p_lable = $("<p>", {
    
    
                    text: elem.desc
                });
                let i_lable = $("<i>", {
    
    
                    text: elem.url
                });
                let div_lable = $("<div>", {
    
    
                    class: "item"
                });
                a_lable.appendTo(div_lable);
                p_lable.appendTo(div_lable);
                i_lable.appendTo(div_lable);
                div_lable.appendTo(result_lable);
            }
        }
    </script>
</body>
</html>

十: 工具类的编写

该类的模块是在util.hpp模块中的；

#pragma once
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <unordered_set>
#include <mutex>

#include <boost/algorithm/string.hpp>
#include "cppjieba/Jieba.hpp"
#include"log.hpp"
namespace ns_util
{
    
    
  class FileUtil
  {
    
    
  public:
    static bool ReadFile(const std::string &file_name, std::string *out)
    {
    
    
      //创建一个读取文件的对象
      std::ifstream in(file_name, std::ios::in);
      if (!in.is_open())
      {
    
    
        std::cerr << "open file" << file_name << " error" << std::endl;
        return false;
      }
      //打开成功读取文件,就读取文件内容
      std::string line;

      while (std::getline(in, line))
      {
    
    
        *out += line;
      }
      in.close();
      return true;
    }
  };
  class StringUtil
  {
    
    
  public:
    static void Split(const std::string &target, std::vector<std::string> *out, const std::string &sep)
    {
    
    
      // boost split
      boost::split(*out, target, boost::is_any_of(sep), boost::token_compress_on);
    }
  };
  // cppjieba词库路径
  const char *const DICT_PATH = "./dict/jieba.dict.utf8";
  const char *const HMM_PATH = "./dict/hmm_model.utf8";
  const char *const USER_DICT_PATH = "./dict/user.dict.utf8";
  const char *const IDF_PATH = "./dict/idf.utf8";
  const char *const STOP_WORD_PATH = "./dict/stop_words.utf8"; //暂停词词库

  //该结巴分词的类是没有去掉暂停词
  // class JiebaUtil
  //   {
    
    
  //     private:
  //       static cppjieba::Jieba jieba;
  //     public:
  //       //对src字符串进行分词，分词结果存在out中
  //       static void CurString(const std::string& src,std::vector<std::string>* out)
  //       {
    
    
  //         jieba.CutForSearch(src,*out);
  //       }
  //   };
  //     cppjieba::Jieba JiebaUtil:: jieba(DICT_PATH, HMM_PATH,USER_DICT_PATH,IDF_PATH,STOP_WORD_PATH);
  // }

  //分词时候，去掉暂停词
  class JiebaUtil
  {
    
    
  private:
    cppjieba::Jieba jieba;
    std::unordered_set<std::string> stop_words; //暂停词，set方便快速查找
    static JiebaUtil *instance;

  private:
    JiebaUtil() : jieba(DICT_PATH, HMM_PATH, USER_DICT_PATH, IDF_PATH, STOP_WORD_PATH) {
    
    }
    JiebaUtil(const JiebaUtil &) = delete;
    JiebaUtil &operator=(const JiebaUtil &) = delete;

  public:
    static JiebaUtil *GetInstance()
    {
    
    
      std::mutex mtx;
      if (nullptr == instance)
      {
    
    
        mtx.lock();
        if (nullptr == instance)
        {
    
    
          instance = new JiebaUtil();
          instance->InitJiebaUtil();
        }
        mtx.unlock();
      }
      return instance;
    }
    void InitJiebaUtil()
    {
    
    
      std::ifstream in(STOP_WORD_PATH);
      if (!in.is_open())
      {
    
    
        LOG(FATAL, "load stop word failed...");
        return;
      }
      std::string line;
      while (std::getline(in, line))
      {
    
    
        stop_words.insert(line);
      }
      in.close();
    }
    
    void CutStringHelper(const std::string &src, std::vector<std::string> *out)
    {
    
    
      jieba.CutForSearch(src, *out);
      //去暂停词:遍历分词的vector集合
      for (auto it = out->begin(); it != out->end();)
      {
    
    
        auto iter = stop_words.find(*it);
        if (iter !=stop_words.end())
        {
    
    
          //当前的分词是暂停词
          it = out->erase(it);
        }
        else
        {
    
    
          ++it;
        }
      }
    }

  public:
    static void CurString(const std::string &src, std::vector<std::string> *out)
    {
    
    
      GetInstance()->CutStringHelper(src, out);
    }
  };
   JiebaUtil *JiebaUtil::instance = nullptr;

}

十一：添加日志部同时部署服务到Linux

添加建议的日志功能：
该日志仅仅是为了打印一下一些信息，方便调试和观看；

新建log.hpp文件：该文件代码为

#pragma once 

#include<iostream>
#include<string>
#include<ctime>

#define NORMAL  1
#define WARNING 2
#define DEBUG   3
#define FATAL   4

#define LOG(LEVEL,MESSAGE) log(#LEVEL,MESSAGE,__FILE__,__LINE__)

void log(std::string level,std::string message,std::string file,int line)
{
    
    
  std::cout<<"等级 "<<"["<<level<<"]"\
    <<"时间戳 "<<"["<<time(nullptr)<<"]"\
    <<"["<<message<<"]"\
    <<"文件 "<<"["<<file<<"]"\
    <<"行号 "<<"["<<line<<"]"\
    <<std::endl;
}

部署到Linux服务器中，日后你只需要根据ip和端口就可以直接访问了该搜索功能

[xjh@VM-12-10-centos boost_searcher]$ nohup ./http_server &

该命令会自动生成一个 nohup.out 文件，该文件就是你的日志信息输出的位置

项目的拓展方向

建立整站搜索，但是这个对服务器的资源配置比较高要求；
设计一个在线更新的方案，信号，爬虫，完成整个服务器的设计；
信号方式定期去建立倒排正排索引，爬虫爬取相关信息；
不使用组件，而是自己设计一下对应的各种方案；
比如自己写一个http服务啦,或者使用一些Nginx等服务器
在我们的搜索引擎中，添加竞价排名；
热次统计，智能显示搜索关键词（字典树，优先级队列）;
设置登陆注册，引入对mysql的使用；

【C++项目】boost搜索引擎项目

文章目录

项目的gitee地址

项目基本演示

讲解思路

一：项目相关背景

二：搜索引擎的相关宏观原理

三：搜索引擎技术栈和项目环境

四：正排索引 vs 倒排索引 - 搜索引擎具体原理

五：编写数据去标签与数据清洗的模块 Parser

5.1 parser基本代码结构

5.2 使用boost库函数枚举每个html文件名

5.3 解析html代码编写

六: 编写建立索引的模块 Index

6.1 获取正排索引和倒排拉链函数具体实现

6.2 构建索引具体函数实现

6.3 构建正排索引具体函数实现

6.4 构建倒排索引具体函数实现

七: 编写搜索引擎模块 Searcher

7.1 编写Search代码

八: 编写http_server模块

九: 编写前端代码

十: 工具类的编写

十一：添加日志部同时部署服务到Linux

项目的拓展方向

猜你喜欢