版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
核心:1、有大量的微博uid 2、处理微博的反爬虫
一、开始准备工作
1、获取访问微博网页的cookie
谷歌浏览器访问:https://m.weibo.cn/
按F12进入调试模式
复制如图所示的数据,这就是我们需要的cookie了
2、cookie拿到了,接下来就是写代码模仿浏览器访问内容了
/**
* 基于HttpClient 4.3的通用Get方法--微博Cookie
* @param url 提交的URL
* @return 提交响应
*/
public static String get_byCookie(String url,String cookie) {
if(CheckUtil.checkNull(cookie)){
cookie = "SCF=AjGxj6fuG*****00174";//这里就是刚刚你获取的cookie,有很长
}
CloseableHttpClient client = HttpClients.createDefault();
String responseText = "";
CloseableHttpResponse response = null;
try {
HttpGet method = new HttpGet(url);
method.addHeader(new BasicHeader("Cookie",cookie));
RequestConfig config = RequestConfig.custom()
.setConnectTimeout(20*1000) // 连接建立超时,毫秒。
.build();
method.setConfig(config);
response = client.execute(method);
HttpEntity entity = response.getEntity();
if (entity != null) {
responseText = EntityUtils.toString(entity, ENCODING);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
response.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return responseText;
}
3、有些小盆友就要耐不住了,要试试上面的方法了,但是没用哦,微博在这里还做了反爬虫,返回的微博内容在<script></script>里面的FM.view中;这里要做额外的处理了;注意,这一步处理是核心代码,是我研究的精华
/**
* 爬取微博JS内容
* @param uid
* @return
*/
public static Result<List<MvcWeiboReptile>> get_js_html_byuid(String uid, String cookie){
//默认无数据
Result<List<MvcWeiboReptile>> result = new Result<List<MvcWeiboReptile>>();
result.setType(TypeEnum.FAIL.getCode());
result.setMessage("无数据");
List<MvcWeiboReptile> weiboReptileList = new ArrayList<>();
StringBuffer stringBuffer = new StringBuffer();
stringBuffer.append(smsUtil.get_byCookie("https://weibo.com/u/"+uid,cookie));
if(stringBuffer!=null){
Document document = Jsoup.parse(stringBuffer.toString());
Elements scripts = document.select("script");
for(Element script : scripts){
String[] ss = script.html().split("<script>FM.view");
stringBuffer = new StringBuffer();
for (String x : ss) {
if (x.contains("\"html\":\"")) {
stringBuffer.append(getHtml(x));
}
}
document = Jsoup.parse(stringBuffer.toString());
Elements WB_details = document.getElementsByClass("WB_detail");
for (Element WB_detail : WB_details){
Elements WB_infos = WB_detail.getElementsByClass("WB_info");
if(!WB_infos.isEmpty()&&WB_infos.size()==1){
for(Element WB_info : WB_infos){
if(WB_info.html().contains(uid)){
Elements WB_text = WB_detail.getElementsByClass("WB_text");
Elements WB_from = WB_detail.getElementsByClass("WB_from S_txt2");
String text = WB_text.html();
String time = WB_from.get(0).getElementsByTag("a").attr("title");
Date time_date = DateUtils.parseTimesTampDate(time+":00");
if(time_date.after(DateUtils.getBeginDayOfLastSixMonth())){
if(!StringUtils.equals(text,"转发微博")){
MvcWeiboReptile weiboReptile = new MvcWeiboReptile();
weiboReptile.setContext(filterEmoji(text));
weiboReptile.setCreateTime(time_date);
weiboReptileList.add(weiboReptile);
result.setType(TypeEnum.SUCCESS.getCode());
result.setMessage("有数据");
}else{
// System.out.println("-----------------转发微博-----------------------");
}
}else{
// System.out.println("-----------------6个月之前的数据-----------------------");
}
}else{
// System.out.println("-----------------uid不匹配-----------------------");
}
}
}else{
// System.out.println("-----------------WB_infos.size()!=1-----------------------");
}
}
}
}
result.setData(weiboReptileList);
return result;
4、经过第三步处理后,成功的获取到了我们需要的数据MvcWeiboReptile,这里面是微博内容和发布时间
二、保存数据:写一个定时器,然后调用上面的接口
/**
* 每天早上5点执行 爬取微博数据
* 伪代码
* @throws Exception
*/
@Scheduled(cron = "0 0 5 * * ?")
public synchronized void work1() {
try {
//我把cookie放进redis里,因为它有时会过期,方便更换
String cookie = redisService.get("cookie_for_weibo");
//2、通过uid、Cookie通过get调用获取微博内容
Result<List<MvcWeiboReptile>> listResult = weiboUtils.get_js_html_byuid(uid(),cookie);
//然后保存
} catch (Exception e) {
e.printStackTrace();
}
}
效果图: