爬虫基本库的使用---requests库

使用requests---实现Cookies、登录验证、代理设置等操作

　　　　处理网页验证和Cookies时，需要写Opener和Handler来处理，为了更方便地实现这些操作，就有了更强大的库requests

例子简单使用requests库

 1 import requests
 2 
 3 r = requests.get('http://wwww.baidu.com/')
 4 print(type(r), r.status_code, r.text, r.cookies, sep='\n\n')
 5 
 6 
 7 # 输出：
 8 <class 'requests.models.Response'>
 9 
10 200
11 
12 <!DOCTYPE html>
13 <!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible 
14 ......
15 feedback>æè§åé¦</a>&nbsp;äº¬ICPè¯030173å·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
16 
17 
18 <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

View Code

GET请求

GET请求，返回相应的请求信息

requests.get(url, params, **kwargs)

url表示要捕获的页面链接，params表示url的额外参数（字典或字节流格式），**kwargs表示12个控制访问的参数

 1 import requests
 2 
 3 r = requests.get('http://httpbin.org/get')
 4 print(r.text)
 5 
 6 
 7 # 输出：
 8 {
 9   "args": {}, 
10   "headers": {
11     "Accept": "*/*", 
12     "Accept-Encoding": "gzip, deflate", 
13     "Host": "httpbin.org", 
14     "User-Agent": "python-requests/2.21.0"
15   }, 
16   "origin": "120.85.108.192, 120.85.108.192", 
17   "url": "https://httpbin.org/get"
18 }
19 
20 
21 # 返回结果中包含请求头、URL、IP等信息

View Code

 1 import requests
 2 
 3 data = {
 4     'name': 'LiYihua',
 5     'age': '21'
 6 }
 7 r = requests.get('http://httpbin.org/get', params=data)
 8 print(r.text)
 9 
10 
11 # 输出：
12 {
13   "args": {
14     "age": "21", 
15     "name": "LiYihua"
16   }, 
17   "headers": {
18     "Accept": "*/*", 
19     "Accept-Encoding": "gzip, deflate", 
20     "Host": "httpbin.org", 
21     "User-Agent": "python-requests/2.21.0"
22   }, 
23   "origin": "120.85.108.92, 120.85.108.92", 
24   "url": "https://httpbin.org/get?name=LiYihua&age=21"
25 }

View Code

 1 import requests
 2 
 3 r = requests.get('http://httpbin.org/get')
 4 print(type(r.text), r.json(), type(r.json()), sep='\n\n')
 5 
 6 
 7 # 输出：
 8 <class 'str'>
 9 
10 {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'}, 'origin': '120.85.108.92, 120.85.108.92', 'url': 'https://httpbin.org/get'}
11 
12 <class 'dict'>
13 
14 # json()方法可以将返回结果是JSON格式的字符串转化为字典

View Code

抓取二进制数据

 1 import requests
 2 
 3 r = requests.get('https://github.com/favicon.ico')
 4 print(r.text, r.content, sep='\n\n')
 5 
 6 # response.content返回的是bytes型的数据。
 7 # 如果想取图片，文件，则可以通过r.content
 8 
 9 # response.text返回的是Unicode型的数据。
10 # 如果想取文本，可以通过r.text
11 
12 # 输出：
13 :�������OL��......
14 
15 b'\x00\x00\x01\x00\x02\x00\x10\x10\x00\x00\x0......

View Code

将提取到的图片保存

1 import requests
2 
3 r = requests.get('https://github.com/favicon.ico')
4 with open('favicon.ico', 'wb') as f:
5     f.write(r.content)
6 
7 # 运行结束后生成一个名为favicon.ico的图标

View Code

上一个例子用到的open()方法和with as语句

# open()方法
# def open(file, mode='r', buffering=None, encoding=None, errors=None, newline=None, closefd=True)

# 常用参数:
file表示要打开的文件            mode表示打开文件的模式：只读，写入，追加等

buffering : 如果 buffering 的值被设为 0，就不会有寄存。如果 buffering 的值取 1，访问文件时会寄存行。如果将 buffering 的值设为大于 1 的整数，表明了这就是的寄存区的缓冲大小。如果取负值，寄存区的缓冲大小则为系统默认

# 对于mode参数
========= ===============================================================
    字母的意义
    --------- ---------------------------------------------------------------
    'r'         打开阅读（默认）
    'w'        打开进行写入，首先截断文件
    'x'        创建一个新文件并打开它进行写入
    'a'        打开进行写入，如果文件存在，则附加到文件结尾
    'b'        二进制模式
    't'         文本模式（默认）
    '+'        打开磁盘文件进行更新（读写）
    'U'       通用换行模式（已弃用）
    ========= ===============================================================


# with as 语句
有一些任务，可能事先需要设置，事后做清理工作。对于这种场景，Python的with语句提供了一种非常方便的处理方式。
with的处理基本思想是with所求值的对象必须有一个__enter__()方法，一个__exit__()方法。紧跟with后面的语句被求值后，返回对象的__enter__()方法被调用，这个方法的返回值将被赋值给as后面的变量。当with后面的代码块全部被执行完之后，将调用前面返回对象的__exit__()方法。
代码解释说明：
class Sample:
    def __enter__(self):
        print "In __enter__()"
        return "Foo"
 
    def __exit__(self, type, value, trace):
        print "In __exit__()"
 
def get_sample():
    return Sample()
 
with get_sample() as sample:
    print "sample:", sample

View Code

添加headers

 1 import requests
 2 
 3 r = requests.get('https://www.zhihu.com/explore')
 4 print(r.text)
 5 
 6 
 7 # 输出：
 8 <html>
 9 <head><title>400 Bad Request</title></head>
10 <body bgcolor="white">
11 <center><h1>400 Bad Request</h1></center>
12 <hr><center>openresty</center>
13 </body>
14 </html>
15 
16 # 部分网址需要传递headers，如果不传递，就不能正常请求
17 import requests
18 
19 headers = {
20     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko '
21                   'Chrome/52.0.2743.116 Safari/537.36'
22 }
23 r = requests.get('https://www.zhihu.com/explore', headers=headers)
24 print(r.text)
25 
26 
27 
28 # 输出：
29 <!DOCTYPE html>
30 <html lang="zh-CN" dropEffect="none" class="no-js no-auth ">
31 <head>
32 <meta charset="utf-8" />
33 ......
34 <script type="text/zscript" znonce="d78db0c15fa84270ac967503884baf11"></script>
35 
36 <input type="hidden" name="_xsrf" value="cdb6166e0dc5f38afc3ee95053d7ef55"/>
37 </body>
38 </html>

View Code

POST请求

这是一种比较常见的URL请求方式

 1 import requests
 2 
 3 data = {
 4     'name': 'LiYihua',
 5     'age': 21
 6 }
 7 r = requests.post('http://httpbin.org/post', data=data)
 8 print(r.text)
 9 
10 
11 # 输出：
12 {
13   "args": {}, 
14   "data": "", 
15   "files": {}, 
16   "form": {
17     "age": "21", 
18     "name": "LiYihua"
19   }, 
20   "headers": {
21     "Accept": "*/*", 
22     "Accept-Encoding": "gzip, deflate", 
23     "Content-Length": "19", 
24     "Content-Type": "application/x-www-form-urlencoded", 
25     "Host": "httpbin.org", 
26     "User-Agent": "python-requests/2.21.0"
27   }, 
28   "json": null, 
29   "origin": "120.85.108.90, 120.85.108.90", 
30   "url": "https://httpbin.org/post"
31 }
32 
33 # POST请求成功，获得返回结果，form部分为提交的数据

View Code

响应

text 和 content 获取响应的内容

status code 属性得到状态码 headers 属性得到响应头 cookies属性得到 Cookies

url属性得到 URL history属性得到请求历史

 1 import requests
 2 
 3 r = requests.get('https://www.cnblogs.com/liyihua/')
 4 
 5 print(type(r.status_code), r.status_code,
 6       type(r.headers), r.headers,
 7       type(r.cookies), r.cookies,
 8       type(r.url), r.url,
 9       type(r.history), r.history,
10       sep='\n\n')
11 
12 
13 # 输出：
14 <class 'int'>
15 
16 200
17 
18 <class 'requests.structures.CaseInsensitiveDict'>
19 
20 {'Date': 'Thu, 20 Jun 2019 08:18:00 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Cache-Control': 'private, max-age=10', 'Expires': 'Thu, 20 Jun 2019 08:18:10 GMT', 'Last-Modified': 'Thu, 20 Jun 2019 08:18:00 GMT', 'X-UA-Compatible': 'IE=10', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Encoding': 'gzip'}
21 
22 <class 'requests.cookies.RequestsCookieJar'>
23 
24 <RequestsCookieJar[]>
25 
26 <class 'str'>
27 
28 https://www.cnblogs.com/liyihua/
29 
30 <class 'list'>
31 
32 []

View Code

状态码通常用来判断请求是否成功

 1 import requests
 2 
 3 r = requests.get('http://www.baidu.com')
 4 exit() if not r.status_code == requests.codes.ok else print('Request Successfully')
 5 
 6 
 7 # 输出：
 8 Request Successfully
 9 
10 # request.codes.ok 返回成功的状态码200

View Code

返回码和相应的查询条件

高级用法

文件上传

 1 import requests
 2 
 3 files = {
 4     'file': open('favicon.ico', 'rb')
 5 }
 6 r = requests.post('http://httpbin.org/post', files=files)
 7 print(r.text)
 8 
 9 
10 # 输出：
11 {
12   "args": {}, 
13   "data": "", 
14   "files": {
15     "file": "data:application/octetstream;base64,AAABAAIAEBAAAAEAIAAoBQAAJgAAACAgAAABACAAKBQAAE4FAAAoAAAAEAAAACAAAAABACAAAAAAAAAFAAA...
16   }, 
17   "form": {}, 
18   "headers": {
19     "Accept": "*/*", 
20     "Accept-Encoding": "gzip, deflate", 
21     "Content-Length": "6665", 
22     "Content-Type": "multipart/form-data; boundary=c1b665273fc73e67e57ac97e78f49110", 
23     "Host": "httpbin.org", 
24     "User-Agent": "python-requests/2.21.0"
25   }, 
26   "json": null, 
27   "origin": "120.85.108.71, 120.85.108.71", 
28   "url": "https://httpbin.org/post"
29 }

View Code

Cookies

 1 import requests
 2 
 3 headers = {
 4     'Cookie': 'tgw_l7_route=66cb16bc7......ECLNu3tQ',
 5     'Host': 'www.zhihu.com',
 6     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
 7 }
 8 r = requests.get('https://www.zhihu.com', headers=headers)
 9 print(r.text)
10 
11 # 输出：
12 <!doctype html>
13 <html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react-helmet="true">首页 - 知乎</title><meta name="viewport" ......
14 # 说明登录成功
15 
16 
17 # Cookie维持登录状态，首先登录知乎，复制headers中的Cookie，然后将其设置到Headers里面，然后发送请求

View Code

 1 from requests.cookies import RequestsCookieJar
 2 import requests
 3 
 4 cookies = 'tgw_l7_route=66cb16bc7f45da64562a07.......ALNI_MbNds66nlodoTCxp8EVE6ECLNu3tQ'
 5 jar = requests.cookies.RequestsCookieJar()
 6 
 7 headers = {
 8     'Host': 'www.zhihu.com',
 9     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
10 }
11 
12 for cookies in cookies.split(';'):
13     key, value = cookies.split('=', 1)
14     jar.set(key, value)
15 
16 r = requests.get('https://www.zhihu.com', cookies=jar, headers=headers)
17 print(r.text)
18 
19 
20 # 输出同上面一样
21 # 将复制下来的cookies利用split()方法处理分割
22 # 创建RequestsCookieJar对象，利用set()方法设置好每个Cookie的key和value

View Code

会话维持

Session对象，可以方便的维护一个会话

 1 import requests
 2 
 3 requests.get('http://httpbin.org/cookies/set/number/123456789')
 4 r = requests.get('http://httpbin.org/cookies')
 5 print(r.text)
 6 
 7 
 8 # 输出：
 9 {
10   "cookies": {}
11 }
12 
13 
14 import requests
15 
16 s = requests.Session()
17 s.get('http://httpbin.org/cookies/set/number/123456789')
18 r = s.get('http://httpbin.org/cookies')
19 print(r.text)
20 
21 
22 # 输出：
23 {
24   "cookies": {
25     "number": "123456789"
26   }
27 }

View Code

SSL证书验证

 1 import requests
 2 
 3 r = requests.get('https://www.12306.cn')
 4 print(r.status_code)
 5 
 6 # 没有出错会输出：200
 7 # 如果请求一个HTTPS站点，但是证书验证错误的页面时，就会错误。
 8 
 9 
10 # 为了避免错误，可以将改例子稍作修改
11 import requests
12 from requests.packages import urllib3
13 
14 urllib3.disable_warnings()
15 r = requests.get('https://www.12306.cn', verify=False)
16 print(r.status_code)

View Code

代理设置

 1 import requests
 2 
 3 proxies = {
 4     'http': 'socks5://user:[email protected]:3128',
 5     'https': 'socks5://user:[email protected]:1080'
 6 }
 7 
 8 requests.get('https://www.taobao.com', proxies=proxies)
 9 
10 
11 # 使用SOCKS协议代理

View Code

超时设置

1 import requests
2 
3 r = requests.get('https://taobao.com', timeout=(0.1, 1))
4 print(r.status_code)
5 
6 # 输出：200

View Code

身份验证

 1 import requests
 2 from requests.auth import HTTPBasicAuth
 3 
 4 r = requests.get('http://localhost', auth=HTTPBasicAuth('liyihua', 'woshiyihua134'))
 5 print(r.status_code)
 6 
 7 
 8 # 输出：200
 9 
10 
11 # 也可以使用OAuth1方法
12 import requests
13 from requests_oauthlib import OAuth1
14 
15 url = 'https://api.twitter.com/1.1/account/verify_credentials.json'
16 auth = OAuth1('YOUR_APP_KEY', 'YOUR_APP_SECRET'
17               'USER_OAUTH_TOKEN', 'USER_OAUTH_TOKEN_SECRET')
18 requests.get(url, auth=auth)

View Code

Prepared Request（准备请求）

要获取一个带有状态的 Prepared Request， 需要用Session.prepare_request()

 1 from requests import Request, Session
 2 
 3 url = 'http://httpbin.org/post'
 4 data = {
 5     'name': 'LiYihua'
 6 }           # 参数
 7 header = {
 8     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537 (KHTML, like Gecko Chrome/53.0.2785.116 Safari/537.36'
 9 }           # 伪装浏览器
10 s = Session()                       # 会话维持
11 req = Request('POST', url, data=data, headers=header)
12 
13 prepped = s.prepare_request(req)            # Session的prepare_request()方法将req转化为一个 Prepared Request对象 
14 r = s.send(prepped)                 # send() 发送请求
15 print(r.text)
16 
17 
18 # 输出：
19 {
20   "args": {}, 
21   "data": "", 
22   "files": {}, 
23   "form": {
24     "name": "LiYihua"
25   }, 
26   "headers": {
27     "Accept": "*/*", 
28     "Accept-Encoding": "gzip, deflate", 
29     "Content-Length": "12", 
30     "Content-Type": "application/x-www-form-urlencoded", 
31     "Host": "httpbin.org", 
32     "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537 (KHTML, like Gecko Chrome/53.0.2785.116 Safari/537.36"
33   }, 
34   "json": null, 
35   "origin": "120.85.108.184, 120.85.108.184", 
36   "url": "https://httpbin.org/post"
37 }

View Code