一:HttpClient知识整理

一:httpclient 简介

HttpClient 是 Apache Jakarta Common 下的子项目,可以用来提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包,并且它支持 HTTP 协议最新的版本和建议。

超文本传输协议(HTTP)可能是当今Internet上使用的最重要的协议。Web服务,支持网络的设备和网络计算的发展继续将HTTP协议的作用扩展到用户驱动的Web浏览器之外,同时增加了需要HTTP支持的应用程序的数量。尽管java.net包提供了通过HTTP访问资源的基本功能,但它并未提供许多应用程序所需的完全灵活性或功能。HttpClient旨在通过提供一个高效,最新且功能丰富的包来实现这一空白,该包实现了最新HTTP标准和建议的客户端。HttpClient专为扩展而设计,同时为基本HTTP协议提供强大支持,HttpClient可能对构建支持HTTP的客户端应用程序(如Web浏览器,Web服务客户端或利用或扩展HTTP协议进行分布式通信的系统)感兴趣。

HttpClient主页: http://hc.apache.org/

HttpClient下载:http://hc.apache.org/downloads.cgi

最新版本4.5 http://hc.apache.org/httpcomponents-client-4.5.x/

官方文档: http://hc.apache.org/httpcomponents-client-4.5.x/tutorial/html/index.html

maven地址:

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.2</version>
</dependency>

二:httpclient使用流程

使用 HttpClient 发送请求、接收响应很简单,一般需要如下几步即可。

  • 创建 HttpClient 对象。
  • 创建请求方法的实例,并指定请求 URL。如果需要发送 GET 请求,创建 HttpGet 对象;如果需要发送 POST 请求,创建 HttpPost 对象。
  • 如果需要发送请求参数,可调用 HttpGet、HttpPost 共同的 setParams(HttpParams params) 方法来添加请求参数;对于 HttpPost 对象而言,也可调用 setEntity(HttpEntity entity) 方法来设置请求参数。
  • 调用 HttpClient 对象的 execute(HttpUriRequest request) 发送请求,该方法返回一个 HttpResponse。
  • 调用 HttpResponse 的 getAllHeaders()、getHeaders(String name) 等方法可获取服务器的响应头;调用 HttpResponse 的 getEntity() 方法可获取 HttpEntity 对象,该对象包装了服务器的响应内容。程序可通过该对象获取服务器的响应内容。
  • 释放连接。无论执行方法是否成功,都必须释放连接

三:HelloWorld 程序

1.创建helloworld程序

public class HelloWorld2 {
    public static void main(String[] args) throws ClientProtocolException, IOException {
        //1.创建httpclient实例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //2.创建httpget实例(请求)
        HttpGet httpGet = new HttpGet("http://www.java1234.com");
        //3.httpclient执行(httpget)请求
        CloseableHttpResponse response = httpClient.execute(httpGet);   //执行http get请求
        //4.获取返回的实体(entity)
        HttpEntity entity = response.getEntity();
        String context = EntityUtils.toString(entity, "utf-8"); //获取网页内容
        System.out.println("网页内容是:"+context);
        //5.关闭资源
        response.close();   //response关闭
        httpClient.close(); //httpClient关闭
    }   
}

2.创建HttpGet请求

添加依赖

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>fluent-hc</artifactId>
    <version>4.5.5</version>
</dependency>
<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpmime</artifactId>
    <version>4.5.5</version>
</dependency>

public class MyTest {
    public static void main(String[] args) {
        get();
    }
  
    private static void get() {
        // 创建 HttpClient 客户端,打开浏览器
        CloseableHttpClient httpClient = HttpClients.createDefault();

        // 创建 HttpGet 请求,输入url
        HttpGet httpGet = new HttpGet("http://localhost:8080/content/page?draw=1&start=0&length=10");
        // 设置长连接
        httpGet.setHeader("Connection", "keep-alive");
        // 设置代理(模拟浏览器版本)
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36");
        // 设置 Cookie
        httpGet.setHeader("Cookie", "UM_distinctid=16442706a09352-0376059833914f-3c604504-1fa400-16442706a0b345; CNZZDATA1262458286=1603637673-1530123020-%7C1530123020; JSESSIONID=805587506F1594AE02DC45845A7216A4");

        //发送请求,回车
        CloseableHttpResponse httpResponse = null;
        try {
            // 请求并获得响应结果
            httpResponse = httpClient.execute(httpGet);
            HttpEntity httpEntity = httpResponse.getEntity();
            // 输出请求结果
            System.out.println(EntityUtils.toString(httpEntity));
        } catch (IOException e) {
            e.printStackTrace();
        } finally { // 无论如何必须关闭连接
            if (httpResponse != null) {
                try {
                    httpResponse.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if (httpClient != null) {
                try {
                    httpClient.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }
}

3.创建HttpPost请求

public class MyTest {
    public static void main(String[] args) {
        post();
    }

    private static void post() {
        // 创建 HttpClient 客户端
        CloseableHttpClient httpClient = HttpClients.createDefault();

        // 创建 HttpPost 请求
        HttpPost httpPost = new HttpPost("http://localhost:8080/content/page");
        // 设置长连接
        httpPost.setHeader("Connection", "keep-alive");
        // 设置代理(模拟浏览器版本)
        httpPost.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36");
        // 设置 Cookie
        httpPost.setHeader("Cookie", "UM_distinctid=16442706a09352-0376059833914f-3c604504-1fa400-16442706a0b345; CNZZDATA1262458286=1603637673-1530123020-%7C1530123020; JSESSIONID=805587506F1594AE02DC45845A7216A4");

        // 创建 HttpPost 参数
        List<BasicNameValuePair> params = new ArrayList<BasicNameValuePair>();
        params.add(new BasicNameValuePair("draw", "1"));    //请求参数中的key-value值
        params.add(new BasicNameValuePair("start", "0"));
        params.add(new BasicNameValuePair("length", "10"));

        CloseableHttpResponse httpResponse = null;
        try {
            // 设置 HttpPost 参数
            httpPost.setEntity(new UrlEncodedFormEntity(params, "UTF-8"));
            httpResponse = httpClient.execute(httpPost);
            HttpEntity httpEntity = httpResponse.getEntity();
            // 输出请求结果
            System.out.println(EntityUtils.toString(httpEntity));
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (ClientProtocolException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally { // 无论如何必须关闭连接
            try {
                if (httpResponse != null) {
                    httpResponse.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
            try {
                if (httpClient != null) {
                    httpClient.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

四:模拟浏览器抓取网页

1.设置请求头消息User-Agent模拟浏览器(此处是chrome浏览器)

httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");

2.获取响应内容类型Content-Type

//获取响应内容类型Content-Type;  getName()是获取key,getValue()是获取value
entity.getContentType().getValue();

3.获取响应状态Status

response.getStatusLine().getStatusCode();

200 -- 正常
403 -- 拒绝
500 -- 服务器报错
400 -- 未找到页面

4.示例

public class Demo2 {    
    public static void main(String[] args) throws ClientProtocolException, IOException {
        //1.创建httpclient实例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        
        //2.创建httpget实例(请求)
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        //设置请求头消息User-Agent模拟浏览器(此处是chrome浏览器)
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
        
        //3.httpclient执行(httpget)请求
        CloseableHttpResponse response = httpClient.execute(httpGet);   //执行http get请求
        System.out.println("Status:"+response.getStatusLine().getStatusCode()); //获取响应状态Status
        
        //4.获取返回的实体(entity)
        HttpEntity entity = response.getEntity();
        //获取响应内容类型Content-Type;  getName()是获取key,getValue()是获取value
        System.out.println("Content-Type:"+entity.getContentType().getValue());
        //获取网页内容
//      String context = EntityUtils.toString(entity, "utf-8");
//      System.out.println("网页内容是:"+context);
        
        //5.关闭资源
        response.close();   //response关闭
        httpClient.close(); //httpClient关闭
    }
}

五:httpclient 抓取图片

public class Demo1 {
    public static void main(String[] args) throws ClientProtocolException, IOException {
        //1.创建httpclient实例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        
        //2.创建httpget实例(请求)
        HttpGet httpGet = new HttpGet("http://www.java1234.com/uploads/allimg/161105/1-161105150121954.jpg");
        //设置请求头消息User-Agent模拟浏览器(此处是chrome浏览器)
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
        
        //3.httpclient执行(httpget)请求
        CloseableHttpResponse response = httpClient.execute(httpGet);   //执行http get请求
        
        //4.获取返回的实体(entity)
        HttpEntity entity = response.getEntity();
        if(entity!=null) {
            //打印实体的内容类型
            System.out.println("Content-Type:"+entity.getContentType().getValue());
            //获取实体的输入流
            InputStream inputStream = entity.getContent();
            //将输入流复制到新建的文件
            FileUtils.copyToFile(inputStream, new File("E://mysource/picture/aaa.jpg"));
        }
        
        //5.关闭资源
        response.close();   //response关闭
        httpClient.close(); //httpClient关闭
    }   
}

六:httpclient 使用代理ip

在爬取网页的时候,有的目标站点有反爬虫机制,对于频繁访问站点以及规则性访问站点的行为,会采集屏蔽IP措施。

关于代理IP的话 也分几种 透明代理、匿名代理、混淆代理、高匿代理。

1.透明代理(Transparent Proxy)

REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Your IP

透明代理虽然可以直接“隐藏”你的IP地址,但是还是可以从HTTP_X_FORWARDED_FOR来查到你是谁。

2.匿名代理(Anonymous Proxy)

REMOTE_ADDR = proxy IP
HTTP_VIA = proxy IP
HTTP_X_FORWARDED_FOR = proxy IP

匿名代理比透明代理进步了一点:别人只能知道你用了代理,无法知道你是谁。

3.混淆代理(Distorting Proxies)

REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Random IP address

如上,与匿名代理相同,如果使用了混淆代理,别人还是能知道你在用代理,但是会得到一个假的IP地址,伪装的更逼真.

4.高匿代理(Elite proxy或High Anonymity Proxy)

REMOTE_ADDR = Proxy IP
HTTP_VIA = not determined
HTTP_X_FORWARDED_FOR = not determined

可以看出来,高匿代理让别人根本无法发现你是在用代理,所以是最好的选择.

那代理IP 从哪里搞呢 很简单 百度一下,你就知道 一大堆代理IP站点。 一般都会给出一些免费的,但是花点钱搞收费接口更加方便;比如 http://www.66ip.cn/

5.示例

public class Demo1 {
    public static void main(String[] args) throws ClientProtocolException, IOException {
        //1.创建httpclient实例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        
        //2.创建httpget实例(请求)
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        //设置代理ip
        HttpHost proxy = new HttpHost("42.121.15.99",3128);
        RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
        httpGet.setConfig(config);
        //设置请求头消息User-Agent模拟浏览器(此处是chrome浏览器)
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
        
        //3.httpclient执行(httpget)请求
        CloseableHttpResponse response = httpClient.execute(httpGet);   //执行http get请求
        
        //4.获取返回的实体(entity)
        HttpEntity entity = response.getEntity();
        String context = EntityUtils.toString(entity, "utf-8"); //获取网页内容
        System.out.println("网页内容是:"+context);
        
        //5.关闭资源
        response.close();   //response关闭
        httpClient.close(); //httpClient关闭
    }   
}

七:httpclient 连接超时及读取超时

httpClient在执行具体http请求时候 有一个连接的时间和读取内容的时间;

HttpClient连接时间,所谓连接的时候 是HttpClient发送请求的地方开始到连接上目标url主机地址的时间。

HttpClient读取时间,所谓读取的时间 是HttpClient已经连接到了目标服务器,然后进行内容数据的获取。

国外maven仓库地址:http://central.maven.org/maven2/

示例:

public class Demo1 {
    public static void main(String[] args) throws ClientProtocolException, IOException {
        //1.创建httpclient实例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        
        //2.创建httpget实例(请求)
        HttpGet httpGet = new HttpGet("http://central.maven.org/maven2/");
        //设置连接超时及读取超时
        RequestConfig config=RequestConfig.custom()
                .setConnectTimeout(1000)    //设置连接超时时间(单位毫秒)
                .setSocketTimeout(1000) //设置读取超时时间(单位毫秒)
                .build();
        httpGet.setConfig(config);
        //设置请求头消息User-Agent模拟浏览器(此处是chrome浏览器)
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
        
        //3.httpclient执行(httpget)请求
        CloseableHttpResponse response = httpClient.execute(httpGet);   //执行http get请求
        
        //4.获取返回的实体(entity)
        HttpEntity entity = response.getEntity();
        String context = EntityUtils.toString(entity, "utf-8"); //获取网页内容
        System.out.println("网页内容是:"+context);
        
        //5.关闭资源
        response.close();   //response关闭
        httpClient.close(); //httpClient关闭
    }
}

猜你喜欢

转载自www.cnblogs.com/itzlg/p/10699496.html