Python3爬虫从零开始：urllib库的使用（一）

官网文档链接：https://docs.python.org/3/library/urllib.html

包含4个模块：

urllib.request for opening and reading URLs
urllib.error containing the exceptions raised by urllib.request
urllib.parse for parsing URLs
urllib.robotparser for parsing robots.txt files

1.发送请求

（1）利用urlopen()方法可以对网页进行抓取：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

实例1：

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')

print(response.read())

可以看到，得到的编码很乱，可以用decode()方法对其进行转码：

print(response.read().decode('utf-8')

和打开网页源码对比：

可以看出，是一致的。

！注意，不要在【Elements】中直接查看源码，因为【Elements】中的源码可能经过JavaScript操作（如果有的话）与原始请求页面不一样

重要参数说明：

data参数：添加该参数，需要使用bytes()方法将参数转化为字节流编码格式内容。另外，如果添加data参数，则请求方式改变为POST方式。

实例2：

import urllib.request

import urllib.parse

data = bytes(urllib.parse.urlencode({'hello': 'world!'}),encoding='utf-8')

response = urllib.request.urlopen('http://httpbin.org/post',data)

print(response.read().decode('utf-8'))

运行结果：

传递的参数出现在form字段中，表明是模拟了表单提交的方式，以POST方式传输数据。

说明1：data方法的第二个参数指定编码格式。

说明2：urlencode()方法将字典转化为字符串。

timeout参数：设置超时时间。可以配合tyr except语句实现长时间未响应则跳过抓取。

实例:利用超时设置。

import urllib.request

import socket

import urllib.error

try:

    response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.01)

except urllib.error.URLError as e:

    if isinstance(e.reason,socket.timeout):

        print('TIME OUT')

说明：isinstance()方法判断类型一致。

（2）Request

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

实例3：

import urllib.request

request = urllib.request.Request('http://www.baidu.com')

response = urllib.request.urlopen(request)

print(response.read().decode('utf-8'))

即，我们可以通过Request对象，构造更为复杂的数据结构和参数，不再仅仅传递一个URL。

实例4：

from urllib import request,parse

url = 'http://httpbin.org/post'

headers = {

    'User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windos N)',

    'Host':'httpbin.org'

}

dict = {'name':'Germey'}

data = bytes(parse.urlencode(dict),encoding='utf-8')

req = request.Request(url = url,data=data,headers=headers,method='POST')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

说明：参数headers是一个字典，构造请求时可以直接构造，也可以通过调用实例的add_header()方法添加。

！最常用是修改User-Agent来伪装浏览器。如果把上面‘User-Agent’去掉：

（3）Cookies

实例5：

import http.cookiejar,urllib.request

cookie = http.cookiejar.CookieJar() #声明一个CookieJar对象

handler = urllib.request.HTTPCookieProcessor(cookie) #利用HTTPCookieProcessor构造Handler()

opener = urllib.request.build_opener(handler) #利用build_opener()方法构建出Opener

response = opener.open('http://www.baidu.com') #最后执行open()函数

for item in cookie:

    print(item.name+"="+item.value)

实例6：将Cookies以文本形式保存

import http.cookiejar,urllib.request

filename = 'cookies.txt' #文件名字

cookie = http.cookiejar.MozillaCookieJar(filename) #声明一个MozillaCookieJar对象

handler = urllib.request.HTTPCookieProcessor(cookie) #利用HTTPCookieProcessor构造Handler()

opener = urllib.request.build_opener(handler) #利用build_opener()方法构建出Opener

response = opener.open('http://www.baidu.com') #最后执行open()函数

cookie.save(ignore_discard = True,ignore_expires = True)

这里需要将CookieJar转换成MozillaCookieJar。它是CookieJar的子类，用来处理Cookies和文件相关的事件。

运行可以得到cookiex.txt如下：

补充：还可以用LWPCookieJar来读取和保存Cookies:

cookie = http.cookiejar.LWPCookieJar(filename) #声明一个LWPCookieJar对象

此时生成的内容如下：

实例6：利用Cookies

import http.cookiejar,urllib.request

cookie = http.cookiejar.LWPCookieJar()

cookie.load('cookies.txt',ignore_discard=True,ignore_expires=True)

handle = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handle)

response = opener.open('http://www.baidu.com')

print(response.read().decode('utf-8'))

最终会输出百度网页的源代码。

Python3爬虫从零开始：urllib库的使用（一）

猜你喜欢