爬虫之urllib库

URL

url: uniform resource locator

schema://host:post/path/?query-string=xx#anchor

Chinese will be encoded in url

Type of requests

get
post

Request Header

User-Agent
Referer
Cookie

urllib

urlopen
- 发送请求，相关函数集成到urllib.request
- 传递参数url data(如果传递该参数，变成post请求)
- 返回值http.client.HTTPResponse, 有read, readline, getcode函数
- ```
resp = request.urlopen('https://www.bilibili.com/')
print(resp.read(10))
print(resp.readline())
print(resp.getcode())
```
urlretrieve
- 将网页上文件保存至本地
- 参数 url, path
- ```
request.urlretrieve('src', 'path')
```

urlencode

将url中的中文进行编码
params, a dictionary

params = {
    'q':'小天',
}
res = parse.urlencode(params)
url = 'https://www.google.com/search?' + res
print(url)

parse_qs
- urlencode反操作
urlparse | urlsplit
- 二者基本类似，将url分解成各种组成部分
- 前者有 params后者没有
- ```
url = 'http://www.baidu.com/?wd=python&username=tian'
res = parse.urlparse(url)
print(res)
```

url.request.Request

使用request.Request构建请求，传入参数包括url, headers
调用urlopen()传入已构建的请求，进行访问

url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
    'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
    'Cookie':'cookie'
}
data = {
    'first': 'true',
    'pn': '1',
    'kd': 'python'
}

req = request.Request(url, headers=headers, data=parse.urlencode(data).encode('utf8'), method='POST')
resp = request.urlopen(req)
print(resp.read().decode('utf8'))

代理ProxyHandler
- 原理：请求目的网站之前，先请求代理服务器，然后代理服务器访问目的网址，并将接受服务器的数据返回，完成间接访问
- urllib.request.ProxyHandler()创建handler对象，接受参数代理IP字典，格式如:scheme:ip:port
- 使用request中的build_opener接受handler后创建opener对象
- 使用opener对象中的open方法访问
- ```
handler = request.ProxyHandler({'http':'115.75.5.17:38351'})
opener = request.build_opener(handler)

url = 'http://httpbin.org/ip'
req = request.Request(url)
resp = opener.open(req)
```
Cookie

Cookie、Session、Token相关简介

全能技术进阶之路

发布了8 篇原创文章 · 获赞 1 · 访问量 65

私信关注

爬虫之urllib库

爬虫之urllib库

URL

Type of requests

Request Header

urllib

猜你喜欢