JSON 数据格式以及在 Java 网络爬虫中如何解析 JSON 数据?一般java中我们用于操作json的工具有: org.json、Gson 以及 Fastjson,这篇我们来操作网络爬虫中返回数据是json格式的,该怎么处理了。
网络爬虫中经常会遇到 JSON 数据,而在我们请求封装有 JSON 数据的网页时,需要对其进行预处理,使其成为标准化的 JSON 数据。例如可能出现下面的形式:
jQuery18305886476962892728_1531402823026({
"id":"07",
"language": "C++",
"edition": "second",
"author": "E.Balagurusamy"
})
此种包含 JSON 的字符串需要进行预处理(掐头去尾操作),例如上述字符串,在 Java 中可进行如下处理:
//拼接JSON串
String json = "jQuery18305886476962892728_1531402823026({\"id\":\"07\",\"language\": \"C++\",\"edition\": \"second\",\"author\": \"E.Balagurusamy\"})";
//掐头去尾操作
String arr = json.split("\\(")[1];
System.out.println(arr.substring(0,arr.length() - 1));
验证json的网站:json验证
针对java对象转json,json对象转java对象,json字符串转java对象,json字符串转json对象,这些基础知识,需要了解的网上有相关资料,可以去查一查,这里就不啰嗦了。
爬虫实战案例
下面来一个真实的爬虫网站实例:
网站地址:http://www.haodou.com/recipe/853171/
第一步,抓包分析评论对应的真实地址
打开f12:
第二步,掐头去尾,在线校验json数据:http://www.bejson.com/
{
"status": 200,
"data": {
"total": 7,
"data": {
"_30376977": {
"CommentId": 30376977,
"ItemId": 853171,
"UserId": 4003739,
"ReplyId": 0,
"Type": 0,
"AtUserId": 0,
"Content": "漂亮美味",
"ImageNum": 0,
"Platform": "iPhone客户端",
"Status": 1,
"SubCommentCnt": 1,
"OpenDataId": "",
"OpenUserName": "yxeg5",
"OpenUserHome": "http:\/\/www.haodou.com\/cook-4003739\/",
"OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/9b\/17\/4003739_70.jpg",
"CreateTime": "2016-02-15 12:22",
"Vip": "<a href=\"http:\/\/www.haodou.com\/recipe\/expert\/apply\" target=\"_blank\"><i class=\"ico12 mod_v\"><\/i><\/a> ",
"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-513793.html\" target=\"_blank\">【第119期】好问豆答:蜜三刀的制作技巧<\/a><\/span>",
"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
"Admin": "non"
},
"_29589112": {
"CommentId": 29589112,
"ItemId": 853171,
"UserId": 9235790,
"ReplyId": 0,
"Type": 0,
"AtUserId": 0,
"Content": "紫菜是干的还是",
"ImageNum": 0,
"Platform": "Android客户端",
"Status": 1,
"SubCommentCnt": 1,
"OpenDataId": "",
"OpenUserName": "喻平凶",
"OpenUserHome": "http:\/\/www.haodou.com\/cook-9235790\/",
"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/4e\/ed\/9235790_70.jpg",
"CreateTime": "2015-12-26 09:36",
"Vip": "",
"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_0\"><\/span> 新手<\/span>",
"LastAct": "<span><span class=\"gray9\">最近发布了菜谱专辑:<\/span> <a href=\"http:\/\/www.haodou.com\/recipe\/album\/9061657\/\" target=\"_blank\">炒饭<\/a><\/span>",
"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
"Admin": "non"
},
"_29407043": {
"CommentId": 29407043,
"ItemId": 853171,
"UserId": 3342562,
"ReplyId": 0,
"Type": 0,
"AtUserId": 0,
"Content": "超市有干贝和海蛎卖?",
"ImageNum": 0,
"Platform": "好豆网",
"Status": 1,
"SubCommentCnt": 1,
"OpenDataId": "",
"OpenUserName": "秋玉的美",
"OpenUserHome": "http:\/\/www.haodou.com\/cook-3342562\/",
"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e2\/00\/3342562_70.jpg",
"CreateTime": "2015-12-05 15:54",
"Vip": "",
"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_1\"><\/span> 豆芽<\/span>",
"LastAct": "",
"PlatformUrl": "http:\/\/www.haodou.com\/",
"Admin": "non"
},
"_28188378": {
"CommentId": 28188378,
"ItemId": 853171,
"UserId": 8008371,
"ReplyId": 0,
"Type": 0,
"AtUserId": 0,
"Content": "干贝虾米一般都是咸的,要用水多泡会,泡软",
"ImageNum": 0,
"Platform": "Android客户端",
"Status": 1,
"SubCommentCnt": 1,
"OpenDataId": "",
"OpenUserName": "月上荒城6",
"OpenUserHome": "http:\/\/www.haodou.com\/cook-8008371\/",
"OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/b3\/32\/8008371_70.jpg",
"CreateTime": "2015-07-09 12:51",
"Vip": "",
"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_1\"><\/span> 豆芽<\/span>",
"LastAct": "",
"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
"Admin": "non"
},
"_27165505": {
"CommentId": 27165505,
"ItemId": 853171,
"UserId": 3837,
"ReplyId": 0,
"Type": 0,
"AtUserId": 0,
"Content": "食材丰富--口感也丰富!",
"ImageNum": 0,
"Platform": "好豆网",
"Status": 1,
"SubCommentCnt": 3,
"OpenDataId": "",
"OpenUserName": "爱跳舞的老太",
"OpenUserHome": "http:\/\/www.haodou.com\/cook-3837\/",
"OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/fd\/0e\/3837_70.jpg",
"CreateTime": "2015-02-26 09:42",
"Vip": "<a href=\"http:\/\/www.haodou.com\/recipe\/expert\/apply\" target=\"_blank\"><i class=\"ico12 mod_v\"><\/i><\/a> ",
"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-556709.html\" target=\"_blank\">【深秋食语】在朋友单位吃午餐<\/a><\/span>",
"PlatformUrl": "http:\/\/www.haodou.com\/",
"Admin": "non"
},
"_30383571": {
"CommentId": 30383571,
"ItemId": 853171,
"UserId": 489704,
"ReplyId": 30376977,
"Type": 0,
"AtUserId": 4003739,
"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-4003739\/\" target=\"_blank\">yxeg5<\/a> 感谢你的分享。",
"ImageNum": 0,
"Platform": "Android客户端",
"Status": 1,
"SubCommentCnt": 0,
"OpenDataId": "",
"OpenUserName": "挪红",
"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
"CreateTime": "2016-02-15 21:39",
"Vip": "",
"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【寻找温暖】港仔后请客,品沙县小吃<\/a><\/span>",
"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
"Admin": "non"
},
"_29596058": {
"CommentId": 29596058,
"ItemId": 853171,
"UserId": 489704,
"ReplyId": 29589112,
"Type": 0,
"AtUserId": 9235790,
"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-9235790\/\" target=\"_blank\">喻平凶<\/a> 是干的,要冲洗一下。",
"ImageNum": 0,
"Platform": "Android客户端",
"Status": 1,
"SubCommentCnt": 0,
"OpenDataId": "",
"OpenUserName": "挪红",
"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
"CreateTime": "2015-12-26 23:15",
"Vip": "",
"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【寻找温暖】港仔后请客,品沙县小吃<\/a><\/span>",
"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
"Admin": "non"
},
"_29407675": {
"CommentId": 29407675,
"ItemId": 853171,
"UserId": 489704,
"ReplyId": 29407043,
"Type": 0,
"AtUserId": 3342562,
"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-3342562\/\" target=\"_blank\">秋玉的美<\/a> 商店里有网上也有。",
"ImageNum": 0,
"Platform": "Android客户端",
"Status": 1,
"SubCommentCnt": 0,
"OpenDataId": "",
"OpenUserName": "挪红",
"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
"CreateTime": "2015-12-05 17:11",
"Vip": "",
"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【寻找温暖】港仔后请客,品沙县小吃<\/a><\/span>",
"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
"Admin": "non"
},
"_28189130": {
"CommentId": 28189130,
"ItemId": 853171,
"UserId": 489704,
"ReplyId": 28188378,
"Type": 0,
"AtUserId": 8008371,
"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-8008371\/\" target=\"_blank\">月上荒城6<\/a> 我买的这种不是那种很硬的,很多盐的,要根据情况而定。",
"ImageNum": 0,
"Platform": "Android客户端",
"Status": 1,
"SubCommentCnt": 0,
"OpenDataId": "",
"OpenUserName": "挪红",
"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
"CreateTime": "2015-07-09 15:19",
"Vip": "",
"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【寻找温暖】港仔后请客,品沙县小吃<\/a><\/span>",
"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
"Admin": "non"
},
"_27729797": {
"CommentId": 27729797,
"ItemId": 853171,
"UserId": 489704,
"ReplyId": 27165505,
"Type": 0,
"AtUserId": 7566907,
"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-7566907\/\" target=\"_blank\">haodou8704818142<\/a> 我在厦门,漳州吃的,每一次都不是不一样的。都有紫菜",
"ImageNum": 0,
"Platform": "Android客户端",
"Status": 1,
"SubCommentCnt": 0,
"OpenDataId": "",
"OpenUserName": "挪红",
"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
"CreateTime": "2015-05-07 01:53",
"Vip": "",
"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【寻找温暖】港仔后请客,品沙县小吃<\/a><\/span>",
"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
"Admin": "non"
},
"_27727527": {
"CommentId": 27727527,
"ItemId": 853171,
"UserId": 7566907,
"ReplyId": 27165505,
"Type": 0,
"AtUserId": 489704,
"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-489704\/\" target=\"_blank\">挪红<\/a> 和我们的配料不一样",
"ImageNum": 0,
"Platform": "Android客户端",
"Status": 1,
"SubCommentCnt": 0,
"OpenDataId": "",
"OpenUserName": "haodou8704818142",
"OpenUserHome": "http:\/\/www.haodou.com\/cook-7566907\/",
"OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/3b\/76\/7566907_70.jpg",
"CreateTime": "2015-05-06 19:23",
"Vip": "",
"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_0\"><\/span> 新手<\/span>",
"LastAct": "",
"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
"Admin": "non"
},
"_27166153": {
"CommentId": 27166153,
"ItemId": 853171,
"UserId": 489704,
"ReplyId": 27165505,
"Type": 0,
"AtUserId": 3837,
"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-3837\/\" target=\"_blank\">爱跳舞的老太<\/a> 姐是这儿的人,不知我这样做对吗?",
"ImageNum": 0,
"Platform": "好豆网",
"Status": 1,
"SubCommentCnt": 0,
"OpenDataId": "",
"OpenUserName": "挪红",
"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
"CreateTime": "2015-02-26 11:26",
"Vip": "",
"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【寻找温暖】港仔后请客,品沙县小吃<\/a><\/span>",
"PlatformUrl": "http:\/\/www.haodou.com\/",
"Admin": "non"
}
},
"avatar": "",
"page_nav": "<a href='javaScript:;' page='1' id='' class='cur'>1<\/a><a href='javaScript:;' page='2' id=''>2<\/a><span class='next'><a href='javaScript:;' page='2' id='' class='next'>下一页<\/a><\/span>",
"more": null,
"offset": 0
},
"message": ""
}
第三步,根据接口数据获取字段,封装javabean
package com.jack.spiderone.entity;
import lombok.Data;
/**
* create by jack 2018/11/18
*
* @author jack
* @date: 2018/11/18 11:26
* @Description:
*/
@Data
public class CommentModel {
/**
* 评论的id
*/
private String CommentId;
//评论的菜品
private String ItemId;
//评论的内容
private String Content;
//评论的时间
private String CreateTime;
//评论作者的名称
private String OpenUserName;
}
第四步:
使用 Httpclient 工具或其他 URL 请求工具,获取网页真实地址对应的字符串。针对已获取的字符串在程序中做掐头去尾处理,使其转化成易于解析的 JSON 串(经常使用到正则表达式操作)
代码:
package com.jack.spiderone.service;
import com.alibaba.fastjson.JSONObject;
import com.jack.spiderone.entity.CommentModel;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.util.List;
/**
* create by jack 2018/11/18
*
* @author jack
* @date: 2018/11/18 11:35
* @Description:
*/
public class CookBookSpider {
/**
* 通过url获取json字符串
* @param url
* @return
*/
public static String getJson(String url) throws IOException {
//初始化httpclient
HttpClient httpClient = HttpClients.custom().build();
//使用的请求方法
HttpGet httpget = new HttpGet(url);
//发出get请求
HttpResponse response = httpClient.execute(httpget);
//获取网页内容流
HttpEntity httpEntity = response.getEntity();
//以字符串的形式(需设置编码)
String entity = EntityUtils.toString(httpEntity, "gbk");
//关闭内容流
EntityUtils.consume(httpEntity);
//返回JSON字符串
return entity;
}
/**
* 解析json字符串为对象数组
* @param jsonStr
* @return
*/
public static List<CommentModel> parseData(String jsonStr){
//将uncode码转化为中文
jsonStr = decode(jsonStr);
//使用分割以及正则取代,处理成标准化JSON数组
String jsondata = "{"+jsonStr.split("data\":\\{")[2].split("\"avatar")[0].replaceAll("\"_\\d*[0-9]\":", "");
jsonStr = jsondata.substring(0, jsondata.length()-2);
//将json数组解析成对象集合
List<CommentModel> datalis = JSONObject.parseArray("["+jsonStr.substring(1,jsonStr.length())+"]", CommentModel.class);
return datalis;
}
public static void spiderCookBook() throws IOException {
//需要解析的URL
String url = "http://www.haodou.com/comment.php?do=list&callback=jQuery18304706379730622201_1542510303429&channel=recipe&item=853171&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common&_=1542510303816";
//获取JSON数据
String jsonstring = getJson(url);
//解析JSON数据
List<CommentModel> datalist = parseData(jsonstring);
//输出数据
for (CommentModel comm : datalist) {
System.out.println(comm.getCommentId() + "\t" + comm.getItemId() + "\t" + comm.getContent());
}
}
/**
* 将unicode码转化为中文
* @param unicodeStr
* @return
*/
public static String decode(String unicodeStr) {
if (unicodeStr == null) {
return null;
}
StringBuffer retBuf = new StringBuffer();
int maxLoop = unicodeStr.length();
for (int i = 0; i < maxLoop; i++) {
if (unicodeStr.charAt(i) == '\\') {
if ((i < maxLoop - 5) && ((unicodeStr.charAt(i + 1) == 'u') || (unicodeStr
.charAt(i + 1) == 'U')))
try {
retBuf.append((char) Integer.parseInt(
unicodeStr.substring(i + 2, i + 6), 16));
i += 5;
} catch (NumberFormatException localNumberFormatException) {
retBuf.append(unicodeStr.charAt(i));
}
else
retBuf.append(unicodeStr.charAt(i));
} else {
retBuf.append(unicodeStr.charAt(i));
}
}
return retBuf.toString();
}
public static void main(String[] args) throws IOException {
spiderCookBook();
}
}
运行程序,输出如下:
30376977 853171 漂亮美味
29589112 853171 紫菜是干的还是
29407043 853171 超市有干贝和海蛎卖?
28188378 853171 干贝虾米一般都是咸的,要用水多泡会,泡软
27165505 853171 食材丰富--口感也丰富!
30383571 853171 @<a href="http://www.haodou.com/cook-4003739/" target="_blank">yxeg5</a> 感谢你的分享。
29596058 853171 @<a href="http://www.haodou.com/cook-9235790/" target="_blank">喻平凶</a> 是干的,要冲洗一下。
29407675 853171 @<a href="http://www.haodou.com/cook-3342562/" target="_blank">秋玉的美</a> 商店里有网上也有。
28189130 853171 @<a href="http://www.haodou.com/cook-8008371/" target="_blank">月上荒城6</a> 我买的这种不是那种很硬的,很多盐的,要根据情况而定。
27729797 853171 @<a href="http://www.haodou.com/cook-7566907/" target="_blank">haodou8704818142</a> 我在厦门,漳州吃的,每一次都不是不一样的。都有紫菜
27727527 853171 @<a href="http://www.haodou.com/cook-489704/" target="_blank">挪红</a> 和我们的配料不一样
27166153 853171 @<a href="http://www.haodou.com/cook-3837/" target="_blank">爱跳舞的老太</a> 姐是这儿的人,不知我这样做对吗?
需要注意的是该网页的中文编码 Unicode 码,故需在操作之前将其转化成中文字符。再者,读者可能会思考,一般情况下,我们只知道一个菜谱的 ID(http://www.haodou.com/recipe/853171/),即853171,该如何操作?
抓包获取的真实 URL 中包含 &callback=jQuery183016721538977115902_1531563599327,这个字符串又该如何拼接?另外一个字符串 &_=1531563599599 又该怎么得到?在抓包时,我们会发现,这两个字符串是动态变化的,这和前端 JS 操作有关。但我们可以将这两个字符串从抓包的 URL 中去除,对应的地址为:
http://www.haodou.com/comment.php?do=list&channel=recipe&item=853171&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common
请求这个地址,也是可以成功获取数据的,而且得到的是标准化的 JSON 数据。假如给定另外一个菜品的 ID(http://www.haodou.com/recipe/344953/),即344953,便可有规律的拼接其评论内容对应的 URL:
http://www.haodou.com/comment.php?do=list&channel=recipe&item=344953&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common
再者,评论如果存在多页情况,我们可以通过上述 URL 中的 page 字段操作循环的方式获取多页评论数据。例如,ID 为344953菜品的第二页评论 URL 地址为:
http://www.haodou.com/comment.php?do=list&channel=recipe&item=344953&sort=desc&page=2&size=5&comment_id=0&cate=0&purify=common
源码地址: