想必大家在写爬虫都有遇到过这样的错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

经过查找各大论坛发现原来是该网站对网页进行了压缩，所以你爬取的网页其实是个为解压的网页

所以我们需要进行解压。当然我们首先需要看看网站到底是解压还是为解压过的，进行步骤如下：

使用urllib.request.urlopen（）.info（）查看：显现信息如下：

Server: nginx/1.6.2
Date: Thu, 15 Jun 2017 03:24:02 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 3595
Connection: close
Vary: Accept-Encoding
Content-Encoding: gzip
Set-Cookie: channelid=0; Path=/
Set-Cookie: sid=1497496689230807; Path=/

Content-Encoding: gzip 这句说明网站进行了压缩

解压步骤如下：

我们需要导入 gzip以及io模块：

1：先将爬取的对象进行二进制的转换使用io.BytesIO（‘爬取的对象’）

2：进行解压gzip.GzipFile(‘转换的二进制’)

这样我们爬取的网站就解压完成不会报错了！！！！！

我写的爬虫代码：

import urllib.request
import re,gzip,io

def open_url(url):
    req = urllib.request.Request(url)
    req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36')
    response = urllib.request.urlopen(req)
    
    print(response.info())
    ###判断是否压缩
    if response.info().get('Content-Encoding') == 'gzip':
        buf = io.BytesIO(response.read())
        gzip_f = gzip.GzipFile(fileobj=buf)###进行解压
        content = gzip_f.read()
    else:
        content = response.read()
       
    return content.decode('utf-8')

def get_img(html):
    p = r'<td data-title="IP">(.+)<'
    imglist = re.findall(p,html)
    print(imglist)


if __name__ == '__main__':
    url="http://www.kuaidaili.com/free/"
    get_img(open_url(url))

本文讲解的更多的是错误处理....

d摄氏度

Python 爬虫IP代理

想必大家在写爬虫都有遇到过这样的错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

解压步骤如下：

我们需要导入 gzip以及io模块：

1：先将爬取的对象进行二进制的转换使用io.BytesIO（‘爬取的对象’）

2：进行解压gzip.GzipFile(‘转换的二进制’)

这样我们爬取的网站就解压完成不会报错了！！！！！

本文讲解的更多的是错误处理....

猜你喜欢

Python 爬虫IP代理

想必大家在写爬虫都有遇到过这样的错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

解压步骤如下：

我们需要导入 gzip以及io模块：

1：先将爬取的对象进行二进制的转换 使用io.BytesIO（‘爬取的对象’）

2：进行解压gzip.GzipFile(‘转换的二进制’)

这样我们爬取的网站就解压完成不会报错了！！！！！

本文讲解的更多的是错误处理....

猜你喜欢

1：先将爬取的对象进行二进制的转换使用io.BytesIO（‘爬取的对象’）