说明

此文章是主要是为了记录自己的学习日志以及解决过的问题.
在一般情况下可以解决动态渲染的网站用jsoup无法爬取的问题,使用的cdp4j,发现这类文档比较少.
做的时候参考的: 殷天文
的Java爬虫入门篇

本人也是小白,借鉴了很多大牛的文章,也主要是写给自己记录,仅供参考,如有错误请指正.

思路

使用 cdp4j直接调用本地的chrome浏览器,得到渲染后的html页面.
然后再使用jsoup解析获得我们需要的文档.

maven依赖

<!-- cdp4j依赖  -->
<dependency>
    <groupId>io.webfolder</groupId>
    <artifactId>cdp4j</artifactId>
    <version>2.2.1</version>
</dependency>

官方文档

链接: webfolderio/cdp4j Github
.

代码

import java.util.ArrayList;

import org.jsoup.Jsoup;

import io.webfolder.cdp.Launcher;
import io.webfolder.cdp.session.Session;
import io.webfolder.cdp.session.SessionFactory;

/**使用cdp4j爬取动态渲染的网页
 * @author longxianhua
 *
 */
public class TestFSD {

	 public static void main(String args[]) throws Exception {
	        ArrayList<String> arguments= new ArrayList<String>();
	        //如果添加此行就不会弹出google浏览器
	        //arguments.add("--headless");
	        Launcher launcher = new Launcher();
	        //第一个参数是本地谷歌浏览器的可执行地址 
	        try (SessionFactory factory = launcher.launch("C:\\Users\\Administrator\\AppData\\Local\\Google\\Chrome\\Application\\chrome.exe",arguments);
	            Session session = factory.create()) {
	        	//这个参数是你想要爬取的网址
	            session.navigate("想要爬取的网址");
	            //等待加载完毕
	            session.waitDocumentReady();
	            //获得爬取的数据
	            String content = (String) session.getContent();
	            //使用Jsoup转换成可以解析的Document
	            org.jsoup.nodes.Document document = Jsoup.parse(content);
	            //自定义的方法解析
	            DwJSOUP.getDoc(document);
	        }
	    }
	
}

踩的坑

launch方法本来可以不用传,但是首次使用或内置的浏览器路径和你不一样可能会报
chrome not found 找不到浏览器本人亲测- -!
源码另外findChrome()就不贴出来了

    public final SessionFactory launch(List<String> arguments) {
        return launch(findChrome(), arguments);
    }

    public final SessionFactory launch(String chromeExecutablePath, List<String> arguments){
        if (launched()) {
            return factory;
        }

        if (chromeExecutablePath == null || chromeExecutablePath.trim().isEmpty()) {
            throw new CdpException("chrome not found");
        }

        List<String> list = getCommonParameters(chromeExecutablePath, arguments);

        internalLaunch(list, arguments);

        if (!launched()) {
            int counter = 0;
            final int maxCount = 20;
            while (!launched() && counter < maxCount) {
                try {
                    sleep(500);
                    counter += 1;
                } catch (InterruptedException e) {
                    break;
                }
            }
        }

效果

某网站的彩票期号和对应的号码
20181112046
15 09 07 04 14 18 03 12
20181112047
14 02 03 09 04 11 05 07
20181112048
14 20 11 19 06 07 17 04
20181112049
01 12 14 15 09 04 03 20
20181112050
13 17 20 08 04 03 06 11
20181112051
17 03 15 02 01 16 19 18
20181112052
16 06 01 11 08 04 20 13
20181112053
01 18 15 10 20 08 12 13

最后

仅供娱乐和参考,实战价值有限.

JAVA使用cdp4j爬取动态渲染网页的数据

JAVA使用cdp4j爬取动态渲染网页的数据

说明

思路

maven依赖

官方文档

代码

踩的坑

效果

最后

猜你喜欢