从零开始写Python爬虫 -1.1 requests库的安装与使用

request库的基本使用

import requests
r = requests.get("https://www.baidu.com")
print(r.text)

request的get方法时requests库中最常用的方法之一。它接收一个参数url并返回一个http response对象。与get方法相同的，requests库还有许多其他常用方法。

requests库的7个主要方法

方法	说明
requests.request()	构造一个请求，支撑以下个方法的基础方法
requests.get()	获取HTML网页的主要方法，对应于http的GET
requests.head()	获取HTML网页头信息的方法，对应于HTTP的HEAD
requests.post()	向HTML网页提交post请求的方法，对应于HTTP的POST
requests.put()	向HTML网页提交PUT请求的方法，对应于HTTP的PUT
requests.patch()	向HTML网页提交局部修改请求，对应于HTTP的PATCH
requests.delete()	向HTML页面提交删除请求，对应于HTTP的DELETE

requests.get()方法详解

#这个方法可以接收三个参数，其中第二个默认为None，	第三个可选
def get(url,params=None,**kwargs)
#作用是模拟发起get请求
Sends a GET request.
#模拟获取页面到URL链接
:param url:URL for the new:class:Request object.
#额外参数 字典或字节流格式，可选
:param params:(optional) Dictionary or bytes to be sent in the query string for the :class:Request.
#十二个控制访问参数，比如可以自定义header
:param **kwargs: optional arguments that request takes
#返回一个response对象
:return: :class :Response<Response>object
:type:request.Response

**kw参数

kwargs: 控制访问的参数，均为可选项
params : 字典或字节序列，作为参数增加到url中
data : 字典、字节序列或文件对象，作为Request的内容 json : JSON格式的数据，作为Request的内容
headers : 字典，HTTP定制头
cookies : 字典或CookieJar，Request中的cookie
auth : 元组，支持HTTP认证功能
files : 字典类型，传输文件
timeout : 设定超时时间，秒为单位
proxies : 字典类型，设定访问代理服务器，可以增加登录认证
allow_redirects : True/False，默认为True，重定向开关
stream : True/False，默认为True，获取内容立即下载开关
verify : True/False，默认为True，认证SSL证书开关
cert : 本地SSL证书路径
url: 拟更新页面的url链接
data: 字典、字节序列或文件，Request的内容	
json: JSON格式的数据，Request的内容

举例：
1、假设我们需要在get请求里自定义一个header头文件

import requests
hd = {'User-agent' : '123'}
r = requests.get('https://www.baidu.com',headers=hd)
print(r.request.headers)

2、假设我们自定义一个代理池

import requests
pxs = {'http': 'http://user:[email protected]:1234',
    'https': 'https://10.10.10.1:4321'}
r = requests.get('https://www.baidu.com', proxies=pxs)
print (r.request.proxies)

详细了解response对象

import requests
r = requests.get("http://www.baidu.com")

'''
Response(self)
The :class:Response <Response> object, which contains a server's response to an HTTP request.
'''
#HTTP请求的返回状态，比如，200表示成功，404表示失败
print (r.status_code)
#HTTP请求中的headers
print (r.headers)
#从header中猜测的响应的内容编码方式 
print (r.encoding)
#从内容中分析的编码方式（慢）
print (r.apparent_encoding)
#响应内容的二进制形式
print (r.content)
'''
status_code:200 
headers:
{'Server': 'bfe/1.0.8.18', 'Date': 'Tue, 02 May 2017 12:01:47 GMT', 'Content-Type': 'text/html', 'La
st-Modified': 'Mon, 23 Jan 2017 13:28:27 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'Keep-A
live', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no
-cache', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Content-Encoding':
'gzip'}
encoding: ISO-8859-1
apparent_encoding:utf-8
'''

request抓取网页到通用框架

import requests
def getHtmlText(url):
    try:
        r = requests.get(url,timeout=30)
        #如果状态码不是200，则应发HTTOError异常
        r.raise_for_status()
        #设置正确的编码方式
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "Something wrong!"

print (getHtmlText("https://www.baidu.com"))

从零开始写Python爬虫 -1.1 requests库的安装与使用

猜你喜欢