常见的反爬虫策略汇总
对请求头进行检查
User-Agent识别
解决方法:构造一个User-Agent列表,每次向headers中随机注入一个User-Agent
agent = ['Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)']
agents = random.sample(agent, 1)[0]
# 注意:random.sample()返回了一个仅含一个元素的列表,因此我们取[0]
Referer识别
在开发者模式下,点击网络-找到你需要发送的那个目标请求-消息头-请求头有个Referer,这个字段指出了这个请求的来源。
【Referer:此内容用来标识这个请求是从哪个页面发过来的,服务器可以拿到这一信息并做相应的处理,如作来源统计、防盗链处理等(崔庆才老师的个人网站,对请求方法、请求头、请求过程均有详细解释https://cuiqingcai.com/5465.html,包括消息头里各类请求的意义)】
在拉勾网的爬取中,发送请求时如果referer不是搜索结果页面,基本就爬取失败了。
注意:本次爬取的岗位是:python数据分析,“数据分析”是中文字符,因此需要使用urllib.parse模块下的urlencode进行转化,构造referer。因为url 请求链接不能包含非ASCII编码的字符,非ASCII编码的字符被认为是不安全的,所以进行编码。
实例如下。
from urllib.parse import urlencode
from urllib.parse import quote
url_search = "https://www.lagou.com/jobs/list_" + quote('python数据分析') + "?"
para = {
'xl':'本科','px':'default','yx':'2k-5k',
'gx':'实习','city':'北京','district':'朝阳区','isSchoolJob':'1'}
url_search = url_search + urlencode(para)
python 中 quote 与 urlencode 的用法与区别:
https://blog.csdn.net/zjz155/article/details/88060427
urllib.parse.urlencode()
参数:dict类型
返回值:字符串
功能:将key:value 变成 key=编码后的value的形式
urllib.parse.quote()
参数:str类型,中文字符
返回值:编码后的value
Cookies识别
以拉勾网为例,网站会检测发送的请求中的cookies。观察XHR请求(也在Referer边上,叫XMLHttpRequest),发现该请求的cookies是包括搜索结果页面的cookies的,因此模仿浏览器行为时,需要先获取搜索结果页面的全部cookies并添加到headers中。
通常有两种方式可以完成这一点
①将刚刚的目标请求的请求头中的Cookies全部复制粘贴到代码中。优点在于方便,缺点在于之后如果cookies变动不好修改
②运用session方法
s = requests.Session()
s.get(url_search,headers = headers,timeout = 5)
# timeout的设置是必要的,否则一段时间内服务器没有回应会让咱们一直卡在这里。也可以这么写:timeout = (3,7),意思是连接时间为3s,响应时间为7s
cookie = s.cookies
IP识别
原理是对频繁发送请求的设备,记录其IP地址和设备码(不过设备码在哪里看长啥样我还是不清楚),对这些ip进行封锁。
解决办法是:自己建立一个IP代理池,但是我不会,还没学到呢。
整体代码及流程
# -*- coding: utf-8 -*-
"""
Created on Thu Oct 15 15:17:49 2020
method:POST
类型:XHR
@author: djx
"""
from urllib.parse import urlencode
from urllib.parse import quote
import requests
import pymongo
def getpage(url_final:str,page:int):
j:int = 1
url_search = "https://www.lagou.com/jobs/list_" + quote('python数据分析') + "?"
para = {
'xl':'本科','px':'default','yx':'2k-5k','gx':'实习','city':'北京','district':'朝阳区','isSchoolJob':'1'}
url_search = url_search + urlencode(para)
agent = ['Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)']
agents = random.sample(agent, 1)[0]
headers ={
'Accept':'application/json, text/javascript, */*; q=0.01',
'Host': 'www.lagou.com',
'User-Agent':agents,
'Referer':url_search,# 应该为搜索结果页
'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7'}
# 这个header用于获取搜索结果页的cookie
# headers = str(headers)
# print(type(headers))
payload = {
'first':'true','pn':j,'kd':'python数据分析'}#.encode('utf-8')
# print(headers)
s = requests.Session()
s.get(url_search,headers = headers,timeout = 5)
# response = requests.post(url_final,data = payload,headers=headers,cookies = s.cookies,timeout = 5)
# return response.text
# print(s.cookies)
# new_cookies = requests.cookies.RequestsCookieJar()
# new_cookies.set('JSESSIONID','ABAAAECABFAACEA2D7960FECF9FBC9ABF231C39F422F8CD')
# s.cookies.update(new_cookies)
# print(s.cookies)
while j <= page:
print(j)
try:
response = requests.post(url_final,data = payload,headers=headers,cookies = s.cookies,timeout = 5)
if response.status_code == 200:
j += 1
return response.json()
# content.decode('utf-8') 这样会返回一个字符串
except Exception as e:
print("error"+str(e))
return None
#def save_info():
def main():
url_final = "https://www.lagou.com/jobs/positionAjax.json?"
para = {
'xl':'本科','px':'default','yx':'2k-5k','gx':'实习','city':'北京','district':'朝阳区','needAddtionalResult':'false','isSchoolJob':'1'}
url_final = url_final +urlencode(para) # 最终的返回json请求的url
# print(url_final)
content = getpage(url_final,5)
result = content["content"]["positionResult"]["result"]
client = pymongo.MongoClient(host = 'localhost',port=27017)
db = client['JobInfo']
collection = db.SimpleInfo
for i in result:
collection.insert_one(i)
main()