Jsoup + HtmlUtil 实现网易新闻网页爬虫

1.这里先说明为什么要用HtmlUtil,仅用Jsoup不行吗?

如果用Jsoup的方法,那么爬取网页的代码如下,这也是比较简单的形式了。

Document docu1=Jsoup.connect(url).get();

用上述代码只能爬取静态网页的,当遇到动态网页就会发现你想要的内容爬取不出来。因此我用到了HtmlUtil。

具体代码如下:这里面的方法getHtmlFromUrl(String url)返回一个文档对象,然后可以通过Jsoup的一系列方法获得想要的内容。

具体的解释看这篇文章

import org.jsoup.nodes.Document;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import org.jsoup.Jsoup;
public class HtmlUnitUtil {
	public static Document getHtmlFromUrl(String url) throws Exception{
		WebClient webClient = new WebClient();
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setActiveXNative(false);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(10000);
        HtmlPage htmlPage = null;
        try {
            htmlPage = webClient.getPage(url);
            webClient.waitForBackgroundJavaScript(10000);
            String htmlString = htmlPage.asXml();
            return Jsoup.parse(htmlString);
        } finally {
            webClient.close();
        }
	}
}

 下面的url表示你想要爬取的页面的地址。

for(String url:ls){
			Document docu1=null;
			try {
				docu1 = HtmlUnitUtil.getHtmlFromUrl(url);
				Elements lis = docu1.getElementsByClass("hot_text");
				//爬取的模块名
				Elements first_span = docu1.select("#list_wrap > div.list_content > div.area.baby_list_title > h2 > a");				
				for(Element e:lis){
					if(e.getElementsByTag("a").size()==0){
						continue;
					}
					else{
						Element e_a = e.getElementsByTag("a").get(0);
						//新闻标题
						String title = e_a.text();
						String newsUrl=e_a.attr("href");
						newsUrl = "http:" + newsUrl;
						count++;		
						String moduleName=first_span.get(0).text();
						System.out.println(title+"("+moduleName+"):"+newsUrl);												
					}					
				}
			} catch (Exception e1) {
				// TODO Auto-generated catch block
				e1.printStackTrace();
			}					
		}

上面的代码实现了如下的内容爬取。

maven依赖如下:

<dependency>
            <groupId>net.sourceforge.htmlunit</groupId>
            <artifactId>htmlunit</artifactId>
            <version>2.18</version>
        </dependency>
 
        <dependency>
            <groupId>net.sourceforge.htmlunit</groupId>
            <artifactId>htmlunit-core-js</artifactId>
            <version>2.9</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
		    <groupId>commons-logging</groupId>
		    <artifactId>commons-logging-api</artifactId>
		    <version>1.1</version>
		</dependency>
		<!-- https://mvnrepository.com/artifact/commons-collections/commons-collections -->
		<dependency>
		    <groupId>commons-collections</groupId>
		    <artifactId>commons-collections</artifactId>
		    <version>3.2</version>
		</dependency>
		<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
		<dependency>
		    <groupId>commons-io</groupId>
		    <artifactId>commons-io</artifactId>
		    <version>2.5</version>
		</dependency>
扫描二维码关注公众号,回复: 6153958 查看本文章

 感兴趣的可以试试。

参考文章: https://blog.csdn.net/gx304419380/article/details/80619043

猜你喜欢

转载自blog.csdn.net/weixin_39912556/article/details/86481402