爬虫常用的库
常用的库
- urllib 库 在python3中自带,但有很多不方便操作的地方:处理网页验证 、Cookies, Header头信息处理等
- urllib3库 需要安装 $
pip3 install urllib3
网址:https://pypi.org/project/urllib3/
- requests 库 安装 $
pip3 install requests
网址:http://2.python-requests.org/en/master/
- 因此, 实际开发中经常用
urllib3
和 requests
库进行开发,而很少会去选择使用Python自带的 urllib
urllib的应用 (GET方式)
from urllib import request, error
import re
import ssl
context = ssl._create_unverified_context()
url = 'http://bj.58.com/job/?key=python&final=1&jump=1'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
req = request.Request(url, headers=headers)
try:
res = request.urlopen(req, context=context)
html = res.read().decode('utf-8')
print(len(html))
pat = '<span class="address">(.*?)</span> \| <span class="name">(.*?)</span>'
dlist = re.findall(pat, html)
for v in dlist:
print(v[0] + ' : ' + v[1])
except Exception as e:
if hasattr(e, "code"):
print("HTTPError")
print(e.reason)
print(e.code)
elif hasattr(e, "reason"):
print("URLError")
print(e.reason)
urllib的应用 (POST方式)
from urllib import request, parse
import json
import ssl
context = ssl._create_unverified_context()
url = 'https://fanyi.baidu.com/sug';
data = {'kw':'python'}
data = parse.urlencode(data)
headers = {
'Content-Length': len(data),
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'Content-Type': "application/x-www-form-urlencoded; charset=UTF-8"
}
req = request.Request(url, data=bytes(data, encoding="utf-8"), headers=headers)
res = request.urlopen(req, context=context)
str_json = res.read().decode('utf-8')
myjson = json.loads(str_json)
print(myjson['data'][0]['v'])
urllib3 的简单使用举例
import urllib3
import re
url = 'http://www.baidu.com'
http = urllib3.PoolManager()
res = http.request('GET', url)
print('status: %d' % res.status)
data = res.data.decode("utf-8")
print(re.findall("<title>(.*?)</title>", data))
requests 的简单使用举例
import requests
import re
url = 'http://www.baidu.com'
res = requests.get(url)
print('status: %d' % res.status_code)
data = res.content.decode("utf-8")
print(re.findall("<title>(.*?)</title>", data))
requests 的复杂使用举例 (GET方式, 参数分离写法)
import requests
import re
import ssl
context = ssl._create_unverified_context()
url = 'http://bj.58.com/job/'
data = {'key':'python', 'final':1, 'jump': 1}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
try:
res = requests.get(url, headers=headers, params=data)
html = res.content.decode("utf-8")
print(len(html))
pat = '<span class="address">(.*?)</span> \| <span class="name">(.*?)</span>'
dlist = re.findall(pat, html)
for v in dlist:
print(v[0] + ' : ' + v[1])
except Exception as e:
if hasattr(e, "code"):
print("HTTPError")
print(e.reason)
print(e.code)
elif hasattr(e, "reason"):
print("URLError")
print(e.reason)
requests 的复杂使用举例 (POST方式)
import requests
import json
def fanyi(kw):
url = 'https://fanyi.baidu.com/sug';
data = {'kw': kw}
headers = {
'Content-Length': len(data),
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'Content-Type': "application/x-www-form-urlencoded; charset=UTF-8"
}
res = requests.post(url, data = data)
str_json = res.content.decode('utf-8')
myjson = json.loads(str_json)
print(myjson['data'][0]['v'])
if __name__ == '__main__':
while True:
kw = input('请输入要翻译的词:')
if kw == 'q':
break
fanyi(kw)