python爬虫，Urllib库

python爬虫学习之Urllib库，自动模拟HTTP请求，异常处理，爬虫的浏览器伪装技术

1、Urllib基础

1.1 urlretrieve()将一个网页直接爬取保存到本地

>>> import urllib.request

>>> urllib.request.urlretrieve('https://blog.csdn.net/','e:/scrapy.html')

('e:/scrapy.html', <http.client.HTTPMessage object at 0x042DE810>)

1.2 urlcleanup()将urlretrieve()产生的缓存清掉

>>> urllib.request.urlcleanup()

1.3 info()将基本的环境信息展现出来

>>> file=urllib.request.urlopen('https://blog.csdn.net/')

>>> file.info()

<http.client.HTTPMessage object at 0x042044D0>

1.4 Getcode()获取网页当前状态码（返回200状态码，正常状态）；geturl获取爬取当前网页的网址

>>> file.getcode()

200

>>> file.geturl()

'https://blog.csdn.net/'

1.5 超时设置：

Timeout设置为0.1，超时

>>> file=urllib.request.urlopen('https://blog.csdn.net/',timeout=0.1)

Traceback (most recent call last):

File "C:\Users\陈伟\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 1317, in do_open

encode_chunked=req.has_header('Transfer-encoding'))

File "C:\Users\陈伟\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1229, in request

Timeout=1，成功

>>> file=urllib.request.urlopen('https://blog.csdn.net/',timeout=1)

>>>

Timeout=0.5.爬取100次，有的出现异常

>>> for i in range(0,100):

try:

file=urllib.request.urlopen('https://blog.csdn.net/',timeout=0.5)

data=file.read()

print(len(data))

except Exception as e:

print('出现异常：'+str(e))

2、自动模拟HTTP请求

客户端如果要与服务器进行通信，需要通过http请求进行，http请求有很多种，post和get两种请求方式。比如登陆、搜索某些信息的时候会用到。

2.1 模拟get请求：

import urllib.request

keywd='你好'

#如果keywd为中文，解决编码问题

keywd=urllib.request.quote(keywd)

url='http://www.baidu.com/s?wd='+keywd+'&ie=utf-8&tn=baidu'#http不能写成https,否则没有内容

req=urllib.request.Request(url)#将网址变成请求

data=urllib.request.urlopen(req).read()

fh=open('e:/1.html','wb')

fh.write(data)

fh.close

2.2 处理post请求：

import urllib.request

import urllib.parse

url='http://www.iqianyue.com/mypost/'

mydata=urllib.parse.urlencode(

{'name':'hello',

'pass':'123456'

}).encode('utf-8')

req=urllib.request.Request(url,mydata)#将其变成请求

#提交进去

data=urllib.request.urlopen(req).read()

fh=open('e:/2.html','wb')

fh.write(data)

fh.close()

3、异常处理

常见状态码及含义

      301 Moved Permanently :重定向到新的URL,永久性
      302 Found :重定向到临时的URL,非永久性
      304 Not Modified :请求的资源未更新
      400 Bad Request :非法请求
      401 Unauthorized : 请求未经授权
      403 Forbidden :禁止访问
      404 Not Found :没有找到対应页面
      500 Internal Server Error :服努器内部出現错误
      501 Not Implemented :服务器不支持实现请求所需要的功能URLError与HTTPError都是异常处理的类，HTTPError是URLError的子类。HTTPError有异常状态码与异常原因，URLError没有异常状态码，不嫩使用URLError直接代替HTTPError,如果要代替，必须判断是否有状态码属性

出现URLError原因：

1、连不上服务器

2、远程url不存在

3、本地没有网络

4、触发了HTTPError这个子类

import urllib.error

import urllib.request

try:

urllib.request.urlopen('http://blog.csdn.net')

except urllib.error.URLError as e:

if hasattr(e,'code'): #判断是否有状态码

print(e.code)

if hasattr(e,'reason'):#判断是否有原因这个属性

print(e.reason)

4、爬虫的浏览器伪装技术：

import urllib.request

url='https://blog.csdn.net/xx20cw/article/details/84144536'

headers=('User-Agent','User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134')

opener=urllib.request.build_opener()#添加报头信息

opener.addheaders=[headers]

data=opener.open(url).read()

fh=open('e:/3.html','wb')

fh.write(data)

fh.close()