from scrapy_redis.dupefilter import RFPDupeFilter
class CustomFilter(RFPDupeFilter):
def request_seen(self, request):
"""Returns True if request was already seen.
Parameters
----------
request : scrapy.http.Request
Returns
-------
bool
"""
if 'https://segmentfault.com/stop-robot' in request.url:
return False
fp = self.request_fingerprint(request)
# This returns the number of values added, zero if already exists.
added = self.server.sadd(self.key, fp)
return added == 0
这边我写了一个自定义的过滤器,继承于scrapy-redis中的。因为我有个需求是,这条url https://segmentfault.com/stop-robot不过滤。
settings.py
DUPEFILTER_CLASS = 'tutorial.CustomFilter.CustomFilter'
注意我项目名字是tutorial