更为强大的库requests是为了更加方便地实现爬虫操作,有了它 , Cookies 、登录验证、代理设置等操作都不是 .
一、安装requests模块(cmd窗口执行)
pip3 install requests
二、requests的基本方法
import requests response=requests.get("https://www.baidu.com/") print(type(response)) #<class 'requests.models.Response'> response类型 print(response.status_code) #200 获取状态码 print(response.text) #获取网页源码 print(response.content) #获取网页源码 print(response.cookies) #获取网页cookies ,Req u estsCookieJar print(response.headers) #获取请求头
三、推荐一个测试网址:http://httpbin.org测试请求网站,可以随便捣鼓(其他请求方式)
import requests r=requests.post("http://httpbin.org/post") print(r.text) #打印post请求的头部信息 r=requests.put("http://httpbin.org/post") r=requests.delete("http://httpbin.org/post") r=requests.options("http://httpbin.org/post")
这里分别用 post ()、 put ()、 delete ()等方法实现了 POST 、 PUT 、 DELETE 等请求 。
四、get 请求
查看get请求包含的请求信息
import requests r=requests.get("http://httpbin.org/get") print(r.text) #打印get请求信息
结果显示: { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.1" }, "origin": "119.123.196.143", "url": "http://httpbin.org/get" }
结果显示说明:一个请求信息应该包含了请求头、ip地址、URL等信息。
(1)请求添加额外信息
方法一:?key=value&key2=value2... (?:表示起始,&:表示和)
r= requests.get("http://httpbin.org/get?name=germey&age=22")
import requests r= requests.get("http://httpbin.org/get?name=germey&age=22") print(r.text)
{ "args": { "age": "22", "name": "germey" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.1" }, "origin": "119.123.196.143", "url": "http://httpbin.org/get?name=germey&age=22" }
通过运行结果可以判断,请求的链接自动被构造成了:http://httpbin.org/get?name=germey&age=22
方法二:利用get 里面参数params,可以将请求信息编译加载到url中(推荐使用)
import requests data={ "name":"germey", "age":22 } r=requests.get("http://httpbin.org/get",params=data) print(r.text)
结果显示: { "args": { "age": "22", "name": "germey" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.1" }, "origin": "119.123.196.143", "url": "http://httpbin.org/get?name=germey&age=22" }
结果都构造了:http://httpbin.org/get?name=germey&age=22,方法二比较实用
(2)从网页请求到的请求信息都是json格式字符串,转换成字典dict,使用 .json();
如果不是Json格式,则报错:JSON。decodeJSONDecodeError异常
import requests r=requests.get("http://httpbin.org/get") print(type(r.text)) #查看请求头数据类型 print(r.text) #打印请求信息 #r.json() 将json字符串转换为字典 print(type(r.json()))#转换为dict,打印数据类型
<class 'str'> { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.1" }, "origin": "119.123.196.143", "url": "http://httpbin.org/get" } <class 'dict'>
结果显示:请求信息是<str>;r.json()后的数据是<dict>
五、get方法请求抓取网页实例
(1)成功获取知乎的网页信息
# 请求知乎 import requests #构建请求要求信息 data={ "type":"content", "q":"赵丽颖" } url="https://www.zhihu.com/search" #构建请求的ip和服务器信息 headers={ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36" , "origin": "119.123.196.143", } response=requests.get(url,params=data,headers=headers) print(response.text)
这里我们加入了 headers 信息,其中包含了 User- Agent 字段信息, 也就是浏览器标识信息 。 如果
不加这个 ,知乎会禁止抓取,data构造了一个请求搜索信息.
(2)github站点图标下载
import requests r=requests.get(" https://github.com/favicon.ico") print(r.text) print(r.content) with open("github.ico","wb") as f: f.write(r.content)
(3)请求头信息headers
{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.1" }, "origin": "119.123.196.143", "url": "https://httpbin.org/get" }
(4)抓取github图标
r.text 得到的数据是字符串类型
r.content 得到的数据是bytes类型数据
import requests r = requests.get("https://github.com/favicon.ico") print('text',r.text)#获取到字符串 print('content',r.content) #获取的是二进制
import requests r=requests.get(" https://github.com/favicon.ico") print(r.text) print(r.content) with open("gg.ico","wb") as f: f.write(r.content)
六、post请求
带data信息请求
import requests data ={ 'name' :'pig', 'age':18 } r = requests.post('http://httpbin.org/post', data=data) print(r.text)
{ "args": {}, "data": "", "files": {}, "form": { "age": "18", "name": "pig" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Content-Length": "15", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.20.1" }, "json": null, "origin": "119.123.198.80", "url": "http://httpbin.org/post" }
七、请求状态码
1、100状态码:信息状态码
2、200状态码:成功状态码
3、300状态吗:重定向状态码
4、400状态码:客户端错误状态码
5、500状态码:服务器错误状态码