-
使用requests---实现Cookies、登录验证、代理设置等操作
处理网页验证和Cookies时,需要写Opener和Handler来处理,为了更方便地实现这些操作,就有了更强大的库requests
-
-
例子简单使用requests库
1 import requests 2 3 r = requests.get('http://wwww.baidu.com/') 4 print(type(r), r.status_code, r.text, r.cookies, sep='\n\n') 5 6 7 # 输出: 8 <class 'requests.models.Response'> 9 10 200 11 12 <!DOCTYPE html> 13 <!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible 14 ...... 15 feedback>æè§åé¦</a> 京ICPè¯030173å· <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html> 16 17 18 <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
-
GET请求
- GET请求,返回相应的请求信息
- requests.get(url, params, **kwargs)
- url表示要捕获的页面链接,params表示url的额外参数(字典或字节流格式),**kwargs表示12个控制访问的参数
1 import requests 2 3 r = requests.get('http://httpbin.org/get') 4 print(r.text) 5 6 7 # 输出: 8 { 9 "args": {}, 10 "headers": { 11 "Accept": "*/*", 12 "Accept-Encoding": "gzip, deflate", 13 "Host": "httpbin.org", 14 "User-Agent": "python-requests/2.21.0" 15 }, 16 "origin": "120.85.108.192, 120.85.108.192", 17 "url": "https://httpbin.org/get" 18 } 19 20 21 # 返回结果中包含请求头、URL、IP等信息
1 import requests 2 3 data = { 4 'name': 'LiYihua', 5 'age': '21' 6 } 7 r = requests.get('http://httpbin.org/get', params=data) 8 print(r.text) 9 10 11 # 输出: 12 { 13 "args": { 14 "age": "21", 15 "name": "LiYihua" 16 }, 17 "headers": { 18 "Accept": "*/*", 19 "Accept-Encoding": "gzip, deflate", 20 "Host": "httpbin.org", 21 "User-Agent": "python-requests/2.21.0" 22 }, 23 "origin": "120.85.108.92, 120.85.108.92", 24 "url": "https://httpbin.org/get?name=LiYihua&age=21" 25 }
1 import requests 2 3 r = requests.get('http://httpbin.org/get') 4 print(type(r.text), r.json(), type(r.json()), sep='\n\n') 5 6 7 # 输出: 8 <class 'str'> 9 10 {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'}, 'origin': '120.85.108.92, 120.85.108.92', 'url': 'https://httpbin.org/get'} 11 12 <class 'dict'> 13 14 # json()方法可以将返回结果是JSON格式的字符串转化为字典
抓取二进制数据
1 import requests 2 3 r = requests.get('https://github.com/favicon.ico') 4 print(r.text, r.content, sep='\n\n') 5 6 # response.content返回的是bytes型的数据。 7 # 如果想取图片,文件,则可以通过r.content 8 9 # response.text返回的是Unicode型的数据。 10 # 如果想取文本,可以通过r.text 11 12 # 输出: 13 :�������OL��...... 14 15 b'\x00\x00\x01\x00\x02\x00\x10\x10\x00\x00\x0......
将提取到的图片保存
1 import requests 2 3 r = requests.get('https://github.com/favicon.ico') 4 with open('favicon.ico', 'wb') as f: 5 f.write(r.content) 6 7 # 运行结束后生成一个名为favicon.ico的图标
上一个例子用到的open()方法和with as语句
# open()方法 # def open(file, mode='r', buffering=None, encoding=None, errors=None, newline=None, closefd=True) # 常用参数: file表示要打开的文件 mode表示打开文件的模式:只读,写入,追加等 buffering : 如果 buffering 的值被设为 0,就不会有寄存。如果 buffering 的值取 1,访问文件时会寄存行。如果将 buffering 的值设为大于 1 的整数,表明了这就是的寄存区的缓冲大小。如果取负值,寄存区的缓冲大小则为系统默认 # 对于mode参数 ========= =============================================================== 字母的意义 --------- --------------------------------------------------------------- 'r' 打开阅读(默认) 'w' 打开进行写入,首先截断文件 'x' 创建一个新文件并打开它进行写入 'a' 打开进行写入,如果文件存在,则附加到文件结尾 'b' 二进制模式 't' 文本模式(默认) '+' 打开磁盘文件进行更新(读写) 'U' 通用换行模式(已弃用) ========= =============================================================== # with as 语句 有一些任务,可能事先需要设置,事后做清理工作。对于这种场景,Python的with语句提供了一种非常方便的处理方式。 with的处理基本思想是with所求值的对象必须有一个__enter__()方法,一个__exit__()方法。紧跟with后面的语句被求值后,返回对象的__enter__()方法被调用,这个方法的返回值将被赋值给as后面的变量。当with后面的代码块全部被执行完之后,将调用前面返回对象的__exit__()方法。 代码解释说明: class Sample: def __enter__(self): print "In __enter__()" return "Foo" def __exit__(self, type, value, trace): print "In __exit__()" def get_sample(): return Sample() with get_sample() as sample: print "sample:", sample
添加headers
1 import requests 2 3 r = requests.get('https://www.zhihu.com/explore') 4 print(r.text) 5 6 7 # 输出: 8 <html> 9 <head><title>400 Bad Request</title></head> 10 <body bgcolor="white"> 11 <center><h1>400 Bad Request</h1></center> 12 <hr><center>openresty</center> 13 </body> 14 </html> 15 16 # 部分网址需要传递headers,如果不传递,就不能正常请求 17 import requests 18 19 headers = { 20 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko ' 21 'Chrome/52.0.2743.116 Safari/537.36' 22 } 23 r = requests.get('https://www.zhihu.com/explore', headers=headers) 24 print(r.text) 25 26 27 28 # 输出: 29 <!DOCTYPE html> 30 <html lang="zh-CN" dropEffect="none" class="no-js no-auth "> 31 <head> 32 <meta charset="utf-8" /> 33 ...... 34 <script type="text/zscript" znonce="d78db0c15fa84270ac967503884baf11"></script> 35 36 <input type="hidden" name="_xsrf" value="cdb6166e0dc5f38afc3ee95053d7ef55"/> 37 </body> 38 </html>
- url表示要捕获的页面链接,params表示url的额外参数(字典或字节流格式),**kwargs表示12个控制访问的参数
-
POST请求
- 这是一种比较常见的URL请求方式
1 import requests 2 3 data = { 4 'name': 'LiYihua', 5 'age': 21 6 } 7 r = requests.post('http://httpbin.org/post', data=data) 8 print(r.text) 9 10 11 # 输出: 12 { 13 "args": {}, 14 "data": "", 15 "files": {}, 16 "form": { 17 "age": "21", 18 "name": "LiYihua" 19 }, 20 "headers": { 21 "Accept": "*/*", 22 "Accept-Encoding": "gzip, deflate", 23 "Content-Length": "19", 24 "Content-Type": "application/x-www-form-urlencoded", 25 "Host": "httpbin.org", 26 "User-Agent": "python-requests/2.21.0" 27 }, 28 "json": null, 29 "origin": "120.85.108.90, 120.85.108.90", 30 "url": "https://httpbin.org/post" 31 } 32 33 # POST请求成功,获得返回结果,form部分为提交的数据
- 这是一种比较常见的URL请求方式
-
响应
-
text 和 content 获取响应的内容
status code 属性得到状态码 headers 属性得到响应头 cookies属性得到 Cookies
url属性得到 URL history属性得到请求历史
1 import requests 2 3 r = requests.get('https://www.cnblogs.com/liyihua/') 4 5 print(type(r.status_code), r.status_code, 6 type(r.headers), r.headers, 7 type(r.cookies), r.cookies, 8 type(r.url), r.url, 9 type(r.history), r.history, 10 sep='\n\n') 11 12 13 # 输出: 14 <class 'int'> 15 16 200 17 18 <class 'requests.structures.CaseInsensitiveDict'> 19 20 {'Date': 'Thu, 20 Jun 2019 08:18:00 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Cache-Control': 'private, max-age=10', 'Expires': 'Thu, 20 Jun 2019 08:18:10 GMT', 'Last-Modified': 'Thu, 20 Jun 2019 08:18:00 GMT', 'X-UA-Compatible': 'IE=10', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Encoding': 'gzip'} 21 22 <class 'requests.cookies.RequestsCookieJar'> 23 24 <RequestsCookieJar[]> 25 26 <class 'str'> 27 28 https://www.cnblogs.com/liyihua/ 29 30 <class 'list'> 31 32 []
状态码通常用来判断请求是否成功
1 import requests 2 3 r = requests.get('http://www.baidu.com') 4 exit() if not r.status_code == requests.codes.ok else print('Request Successfully') 5 6 7 # 输出: 8 Request Successfully 9 10 # request.codes.ok 返回成功的状态码200
返回码和相应的查询条件
-
-
-
高级用法
-
文件上传
1 import requests 2 3 files = { 4 'file': open('favicon.ico', 'rb') 5 } 6 r = requests.post('http://httpbin.org/post', files=files) 7 print(r.text) 8 9 10 # 输出: 11 { 12 "args": {}, 13 "data": "", 14 "files": { 15 "file": "data:application/octetstream;base64,AAABAAIAEBAAAAEAIAAoBQAAJgAAACAgAAABACAAKBQAAE4FAAAoAAAAEAAAACAAAAABACAAAAAAAAAFAAA... 16 }, 17 "form": {}, 18 "headers": { 19 "Accept": "*/*", 20 "Accept-Encoding": "gzip, deflate", 21 "Content-Length": "6665", 22 "Content-Type": "multipart/form-data; boundary=c1b665273fc73e67e57ac97e78f49110", 23 "Host": "httpbin.org", 24 "User-Agent": "python-requests/2.21.0" 25 }, 26 "json": null, 27 "origin": "120.85.108.71, 120.85.108.71", 28 "url": "https://httpbin.org/post" 29 }
-
Cookies
1 import requests 2 3 headers = { 4 'Cookie': 'tgw_l7_route=66cb16bc7......ECLNu3tQ', 5 'Host': 'www.zhihu.com', 6 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36' 7 } 8 r = requests.get('https://www.zhihu.com', headers=headers) 9 print(r.text) 10 11 # 输出: 12 <!doctype html> 13 <html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react-helmet="true">首页 - 知乎</title><meta name="viewport" ...... 14 # 说明登录成功 15 16 17 # Cookie维持登录状态,首先登录知乎,复制headers中的Cookie,然后将其设置到Headers里面,然后发送请求
1 from requests.cookies import RequestsCookieJar 2 import requests 3 4 cookies = 'tgw_l7_route=66cb16bc7f45da64562a07.......ALNI_MbNds66nlodoTCxp8EVE6ECLNu3tQ' 5 jar = requests.cookies.RequestsCookieJar() 6 7 headers = { 8 'Host': 'www.zhihu.com', 9 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36' 10 } 11 12 for cookies in cookies.split(';'): 13 key, value = cookies.split('=', 1) 14 jar.set(key, value) 15 16 r = requests.get('https://www.zhihu.com', cookies=jar, headers=headers) 17 print(r.text) 18 19 20 # 输出同上面一样 21 # 将复制下来的cookies利用split()方法处理分割 22 # 创建RequestsCookieJar对象,利用set()方法设置好每个Cookie的key和value
-
会话维持
-
Session对象,可以方便的维护一个会话
1 import requests 2 3 requests.get('http://httpbin.org/cookies/set/number/123456789') 4 r = requests.get('http://httpbin.org/cookies') 5 print(r.text) 6 7 8 # 输出: 9 { 10 "cookies": {} 11 } 12 13 14 import requests 15 16 s = requests.Session() 17 s.get('http://httpbin.org/cookies/set/number/123456789') 18 r = s.get('http://httpbin.org/cookies') 19 print(r.text) 20 21 22 # 输出: 23 { 24 "cookies": { 25 "number": "123456789" 26 } 27 }
-
SSL证书验证
1 import requests 2 3 r = requests.get('https://www.12306.cn') 4 print(r.status_code) 5 6 # 没有出错会输出:200 7 # 如果请求一个HTTPS站点,但是证书验证错误的页面时,就会错误。 8 9 10 # 为了避免错误,可以将改例子稍作修改 11 import requests 12 from requests.packages import urllib3 13 14 urllib3.disable_warnings() 15 r = requests.get('https://www.12306.cn', verify=False) 16 print(r.status_code)
-
代理设置
1 import requests 2 3 proxies = { 4 'http': 'socks5://user:[email protected]:3128', 5 'https': 'socks5://user:[email protected]:1080' 6 } 7 8 requests.get('https://www.taobao.com', proxies=proxies) 9 10 11 # 使用SOCKS协议代理
-
超时设置
1 import requests 2 3 r = requests.get('https://taobao.com', timeout=(0.1, 1)) 4 print(r.status_code) 5 6 # 输出:200
- 身份验证
1 import requests 2 from requests.auth import HTTPBasicAuth 3 4 r = requests.get('http://localhost', auth=HTTPBasicAuth('liyihua', 'woshiyihua134')) 5 print(r.status_code) 6 7 8 # 输出:200 9 10 11 # 也可以使用OAuth1方法 12 import requests 13 from requests_oauthlib import OAuth1 14 15 url = 'https://api.twitter.com/1.1/account/verify_credentials.json' 16 auth = OAuth1('YOUR_APP_KEY', 'YOUR_APP_SECRET' 17 'USER_OAUTH_TOKEN', 'USER_OAUTH_TOKEN_SECRET') 18 requests.get(url, auth=auth)
-
Prepared Request(准备请求)
要获取一个带有状态的 Prepared Request, 需要用Session.prepare_request()
1 from requests import Request, Session 2 3 url = 'http://httpbin.org/post' 4 data = { 5 'name': 'LiYihua' 6 } # 参数 7 header = { 8 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537 (KHTML, like Gecko Chrome/53.0.2785.116 Safari/537.36' 9 } # 伪装浏览器 10 s = Session() # 会话维持 11 req = Request('POST', url, data=data, headers=header) 12 13 prepped = s.prepare_request(req) # Session的prepare_request()方法将req转化为一个 Prepared Request对象 14 r = s.send(prepped) # send() 发送请求 15 print(r.text) 16 17 18 # 输出: 19 { 20 "args": {}, 21 "data": "", 22 "files": {}, 23 "form": { 24 "name": "LiYihua" 25 }, 26 "headers": { 27 "Accept": "*/*", 28 "Accept-Encoding": "gzip, deflate", 29 "Content-Length": "12", 30 "Content-Type": "application/x-www-form-urlencoded", 31 "Host": "httpbin.org", 32 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537 (KHTML, like Gecko Chrome/53.0.2785.116 Safari/537.36" 33 }, 34 "json": null, 35 "origin": "120.85.108.184, 120.85.108.184", 36 "url": "https://httpbin.org/post" 37 }
-
-
-