win7 x64 基于spring boot+elasticsearch+Redis+mysql+mybatis进行搜索引擎web开发--爬取IThome热评（二）

太坑了，本来写完了，发布后发现只发了一半，而后面的不知道哪去了，应该是超长了。悲剧的是竟然没有保存，这后半部分我还得重新写啊。

接上篇win7 x64 基于spring boot+elasticsearch+Redis+mysql+mybatis进行搜索引擎web开发--爬取IThome热评（一），我们继续。

上面给出的一篇文章的热评，估计大家能看吐了。你能看明白，我就服了你了。后面我花费了不少时间把这俩例子规范化了一下，大概长下面这个样子：

##https://dyn.ithome.com/ithome/getajaxdata.aspx?newsID=322635&pid=2&type=hotcomment

{
	"html":"
		<li class=\"entry\" cid=\"27429137\" 	nid=\"322635\">
			<div class=\"adiv\">
				<div><a title=\"软媒通行证数字ID：1341046\" target=\"_blank\" href=\"http://quan.ithome.com/user/1341046\"><img class=\"headerimage\" onerror=\"this.src=\u0027//img.ithome.com/images/v2.1/noavatar.png\u0027\" src=\"//avatar.ithome.com/avatars/001/34/10/46_60.jpg\"/>	</a></div>
				<div class=\"level\"><span>Lv.8</span></div>
			</div>
			<div>
				<div class=\"info rmp\"><strong class=\"p_floor\">14楼</strong>	
					<div class=\"nmp\">	<span class=\"nick\">	<a title=\"软媒通行证数字ID：1341046\" target=\"_blank\" href=\"http://quan.ithome.com/user/1341046\">zuk常成</a></span><span class=\"mobile android\"><a target=\"_blank\" title=\"App版本：v5.60\" href=\"//m.ithome.com/ithome/download/\">三星 Galaxy S7 edge</a></span><span class=\"posandtime\">IT之家江苏无锡网友\u0026nbsp;2017-8-23 14:35:29</span></div>
				</div>
				<p>工信部安卓统一推送呢？现在的安卓就是不如iOS。最起码国内的不行，难道用个手机天天翻墙？</p>
				<div class=\"zhiChi\"><div class=\"l\">	<span class=\"comm_reply\"><a class=\"comment_co\">展开(69)<img src=\"//img.ithome.com/images/v2.3/arrow.png\"></a></span></div> 
				<div class=\"r\">	<span class=\"comm_reply\">	<a class=\"s\" id=\"hotagree27429137\" href=\"javascript:hotCommentVote(27429137,1)\">支持(238)</a><a class=\"a\" id=\"hotagainst27429137\" href=\"javascript:hotCommentVote(27429137,2)\">反对(11)</a>	</span></div>
			</div>
		</li>
		...
		<li class=\"entry\" cid=\"27429137\" nid=\"322635\">
			<div class=\"adiv\">
				<div><a title=\"软媒通行证数字ID：789770\" target=\"_blank\" href=\"http://quan.ithome.com/user/789770\"><img class=\"headerimage\" onerror=\"this.src=\u0027//img.ithome.com/images/v2.1/noavatar.png\u0027\" src=\"//avatar.ithome.com/avatars/000/78/97/70_60.jpg\"/></a></div>
				<div class=\"level\"><span>Lv.30</span></div>
			</div>
			<div>
				<div class=\"info rmp\"><strong class=\"p_floor\">14楼8#</strong>
					<div class=\"nmp\"><span class=\"nick\"><a title=\"软媒通行证数字ID：789770\" target=\"_blank\" href=\"http://quan.ithome.com/user/789770\">wp咸蛋</a></span><span class=\"mobile android\"><a target=\"_blank\" title=\"App版本：v5.60\" href=\"//m.ithome.com/ithome/download/\">小米 6</a></span><span class=\"posandtime\">IT之家河北承德网友\u0026nbsp;2017-8-23 15:05:43</span></div>
			</div>
			<p><span>回复6# belie97：</span>QQ微信不支持mipush。。。。</p>
			<div class=\"zhiChi\">
				<div class=\"l\"><span class=\"comm_reply\"><a class=\"comment_co\">展开(69)<img src=\"//img.ithome.com/images/v2.3/arrow.png\"></a></span></div> 
				<div class=\"r\"><span class=\"comm_reply\"><a class=\"s\" id=\"hotagree27430293\" href=\"javascript:hotCommentVote(27430293,1)\">支持(11)</a><a class=\"a\" id=\"hotagainst27430293\" href=\"javascript:hotCommentVote(27430293,2)\">反对(1)</a></span></div>
			</div>
		</li>
	",
	"db":true
}

这下，看的舒服多了吧。unicode编码的处理程序，参考我们的文章。由于返回的不是json格式，所以我们这里只能硬解析。我猜测每一条评论的完整信息包含在<li>...</li>之间，那么我们可以按照它进行分割，然后使用substring函数，一段段的截取数据，然后解析出需要的字段。字段需求如下：

    private Long id;//热评编号
    private String commentId;
    private String user;//用户
    private String comment;//内容
    private int up;//支持数
    private int down;//反对数
    private String posandtime;//位置和时间
    private String mobile;//设备
    private String articleUrl;//源文章地址

7. mysql配置

在运行过程中，我们遇到一些手机app发送的表情符号等字符，而utf-8使用3个字节存储文件，一般文字是够用了，而对于手机app等需要4个字节存储的表情符号等则无能为力，报错内容如下：Incorrect string value: '\xF0\x9F\x98\x82\xF0\x9F...' for column

这里需要修改mysql的配置文件

修改mysql的配置文件，windows下的为my.ini(linux下的为my.cnf)，修改的内容都一样

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

将数据库中对应的字段，改为utf8mb4_general_ci

修改项目中的连接数据库的url，将characterEncoding=utf-8去掉，此步骤一定要进行

8. ES和redis配置

#端口号
server:
  port: 8081

spring:
  data:
  ##elasticsearch配置
    elasticsearch:
      cluster-name: oldkingESCluster
      cluster-nodes: 127.0.0.1:9300
  ##redis配置
  redis:
    database: 0
    host: localhost
    port: 6379
    password: redis
    pool:
      max-active: 15
      max-wait: 1
      max-idle: 0
    timeout: 0
  ##freemarker配置
  freemarker:
  ##是否允许属性覆盖
    allow-request-override: false
    allow-session-override: false
    cache: true
    check-template-location: true
    content-type: text/html
  ##暴露request属性
    expose-request-attributes: false
    expose-session-attributes: false
    expose-spring-macro-helpers: false
    suffix: .ftl
    template-loader-path: classpath:/templates/
    request-context-attribute: request
    settings:
      classic_compatible: true
      locale: zh_CN
      date_format: yyyy-MM-dd
      time_format: HH:mm:ss
      datetime_format: yyyy-MM-dd HH:mm:ss

这里比较简单，一般都能看懂。

9. 项目说明

这里，工程比较简单，代码也比较少，得益于springboot和mybatis。

主要三个包：源码包，测试包，web页面包

这里，我把爬虫爬取页面的过程放到了test里面,数据存储到mysql，索引存到redis，也就是当你编译工程的时候，它就会爬取评论，至于爬多少，你可以设置：

@Test
    public void getArticleLinks() throws Exception {
        for(int i=1;i<2;i++){//原来是i=0，我改成1了，因为放到url中page=0时无法获取内容
            String url = "https://www.ithome.com/ithome/getajaxdata.aspx?" +
                    "page="+i+"&type=indexpage&randnum="+Math.random();
            String src = webCrawler.getHtmlSrc(url,"utf-8");
            List<String> list = webCrawler.getArticleLinks(src);
            for (String link:list){
                //System.out.println(link); 
                webCrawler.parseAndSaveHotComments(link);
                
            }
        }
//        Assert.assertNull(list);
    }

编译成功后，运行工程，然后打开浏览器：localhost:8081/ithome/index，可以看到下图，但是搜索功能还有点问题，后面我会修改。

10. 后续工作

（1）搜索体验

（2）统计结果可视化

（3）索引优化

附：

（1）完整工程（NetBeans）

（2）elasticsearch2.4.6

（3）redis 3.1

（4）treeNMS

（5）nodejs

win7 x64 基于spring boot+elasticsearch+Redis+mysql+mybatis进行搜索引擎web开发--爬取IThome热评（二）

猜你喜欢