写在前面的话 :上一篇文章我们利用urllib进行了网络爬虫的实战训练,但是通过训练后可以发现,urllib的使用其实还是比较繁琐的,所以接下来我们学习更加强大更加便捷的HTTP请求库——requests
温馨提示 :博主使用的系统为Windows10,使用的Python版本为3.6.5
想要了解更规范的官方说明,可参考文档:http://www.python-requests.org/en/master/
一、requests简介
requests是一个方便易用的HTTP请求库,它比python自带的urllib要更方便更快捷,十分适合爬虫进阶使用,以下是官方文档中的描述:
- Beautiful is better than ugly.(美丽优于丑陋)
- Explicit is better than implicit.(直白优于含蓄)
- Simple is better than complex.(简单优于复杂)
- Complex is better than complicated.(复杂优于繁琐)
- Readability counts.(可读性很重要)
requests模块可以使用以下命令安装
$ pip install requests
二、requests基本使用
以下我们选择 http://www.httpbin.org/ 作为测试网站,它提供了丰富的接口,能返回所发送的请求的相关信息, 十分适合检查使用
1. 发送请求——requests.get(url)
requests一个最简单的用法是 requests.get(url)
,该方法仅传入一个参数——url,url可以是一个字符串,指定目标网站的URL,值得注意的是,该方法返回的是一个 Response 对象,常用的属性和方法列举如下:
- response.url:返回请求网站的URL
- response.status_code:返回的状态码
- response.encoding:返回的编码方式
- response.headers:返回的响应头信息(dict类型)
- response.content:返回的响应体(bytes类型)
- response.text:返回的响应体(str类型),相当于
response.content.decode('utf-8')
- response.json():返回的响应体(json类型),相当于
json.loads(response.text)
>>> import requests
>>> import json
>>> response = requests.get('http://www.httpbin.org/get')
#查看response类型
>>> type(response)
<class 'requests.models.Response'>
#返回的请求网站URL
>>> response.url
'http://www.httpbin.org/get'
#返回的状态码
>>> response.status_code
200
#检查返回的编码方式
>>> print(response.encoding)
None
#返回的响应头信息(dict类型)
>>> response.headers
{'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Date': 'Sat, 18 Aug 2018 02:00:23 GMT', 'Content-Type': 'application/json', 'Content-Length': '275', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'}
#检查返回的响应体(bytes类型)
>>> type(response.content)
<class 'bytes'>
>>> response.content
b'{\n "args": {}, \n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate", \n "Connection": "close", \n "Host": "www.httpbin.org", \n "User-Agent": "python-requests/2.19.1"\n }, \n "origin": "116.16.107.178", \n "url": "http://www.httpbin.org/get"\n}\n'
#检查返回的响应体(str类型),相当于 response.content.decode('utf-8')
>>> type(response.text)
<class 'str'>
>>> response.text
'{\n "args": {}, \n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate", \n "Connection": "close", \n "Host": "www.httpbin.org", \n "User-Agent": "python-requests/2.19.1"\n }, \n "origin": "116.16.107.178", \n "url": "http://www.httpbin.org/get"\n}\n'
#检查返回的响应体(json类型),相当于 json.loads(response.text)
>>> type(response.json())
<class 'dict'>
>>> response.json()
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'www.httpbin.org', 'User-Agent': 'python-requests/2.19.1'}, 'origin': '116.16.107.178', 'url': 'http://www.httpbin.org/get'}
注意 :requests模块中发送不同请求的方法接口相似,只需把get改成对应的请求方法即可,例如发送POST请求的方法为 requests.post(url)
2. 附加参数——requests.get(url,data,headers)
可以添加参数 data
指定发送请求附带的表单信息,可以添加参数headers
指定发送请求附带的头部信息
>>> import requests
>>> url = 'http://www.httpbin.org/post'
>>> data = {
'from':'AUTO',
'to':'AUTO'
}
>>> headers = {
'USER-AGENT':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
>>> response = requests.post(url=url,data=data,headers=headers)
>>> print(response.text)
{
"args": {},
"data": "",
"files": {},
"form": { #我们设定的请求信息
"from": "AUTO",
"to": "AUTO"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Content-Length": "17",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "www.httpbin.org",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" #我们设定的请求头部
},
"json": null,
"origin": "116.16.107.178",
"url": "http://www.httpbin.org/post"
}
3. 超时设置——requests.get(url,timeout)
可以添加参数 timeout
指定等待的超时时间,若超过指定时间没获得响应,则抛出异常
>>> import requests
>>> try:
response = requests.get('http://www.httpbin.org/get', timeout=0.1)
except requests.exceptions.RequestException as e:
if isinstance(e,requests.exceptions.Timeout):
print("Time out")
Time out
4. 代理设置——requests.get(url,proxies)
可以添加参数 proxies
指定使用的代理,proxies为字典类型
>>> import requests
>>> url = 'http://www.httpbin.org/ip'
>>> proxies = {
'http':'182.88.178.128:8123',
'http':'61.135.217.7:80'
}
>>> response = requests.get(url=url,proxies=proxies)
>>> print(response.text)
{
"origin": "182.88.178.128"
}
5. 使用Cookie——requests.get(url,cookies)
可以添加参数 cookies
指定Cookies信息,cookies为字典类型
>>> import requests
#使用Cookies登陆
>>> url = 'http://www.httpbin.org/cookies'
>>> cookies = {
'name_one':'value_one',
'name_two':'value_two'
}
>>> response = requests.get(url=url,cookies=cookies)
>>> print(response.text)
{
"cookies": {
"name_one": "value_one",
"name_two": "value_two"
}
}
顺便说一点是可以使用response的cookies属性得到网站返回的Cookies信息
6. 会话维持——requests.Session()
使用session对象可以使我们跨请求保持某些参数,以下示例为保持请求的cookies信息
>>> import requests
>>> session = requests.Session()
#设置请求Cookies
>>> session.get('http://httpbin.org/cookies/set/name/value')
<Response [200]>
#发送另一个GET请求,获取请求Cookies
>>> response = session.get('http://httpbin.org/cookies')
>>> print(response.text)
{
"cookies": {
"name": "value"
}
}
同时使用session对象可以让我们使用省缺参数
>>> import requests
>>> session = requests.Session()
>>> session.auth = ('user', 'password')
>>> session.headers.update({'From': 'AUTO'})
>>> response = session.get('http://httpbin.org/headers')
>>> print(response.text)
{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Authorization": "Basic dXNlcjpwYXNzd29yZA==",
"Connection": "close",
"From": "AUTO", #我们设置的请求头信息
"Host": "httpbin.org",
"User-Agent": "python-requests/2.19.1"
}
}
7. 登陆认证——requests.get(url,auth)
可以添加参数 auth
指定登陆时的账号和密码,auth为字典类型
>>> import requests
>>> response = requests.get(url='http://www.httpbin.org/basic-auth/user/password',auth=('user','password'))
>>> print(response.text)
{
"authenticated": true,
"user": "user"
}
8. 证书验证——requests.get(url,verify=True)
参数 verify
指定请求网站时是否需要进行证书验证,默认情况下为True,当希望不需要进行证书验证时可设置为False
>>> import requests
>>> response = requests.get(url='https://www.httpbin.org/',verify=False)
但是这种情况下,一般会出现Warning提示,因为python强烈建议我们使用证书验证,若不希望看到Warning信息,则可以使用以下命令
>>> requests.packages.urllib3.disable_warnings()
写在后面的话 :通过这篇文章我们学习了requests的基本使用,下一篇文章我们将利用requests进行一些简单的实战训练,谢谢大家