上篇我们实现了数据的存储,包括把数据存储到MongoDB,Mysql以及本地文件,本篇说下分布式。
我们目前实现的是一个单机爬虫,也就是只在一个机器上运行,想象一下,如果同时有多台机器同时运行这个爬虫,并且把数据都存储到同一个数据库,那不是美滋滋,速度也得到了很大的提升。
要实现分布式,只需要对settings.py文件进行适当的配置就能完成。
我们先看下scrapy-redis的用法,官方文档介绍如下:
Use the following settings in your project:
# Enables scheduling storing requests queue in redis. SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Ensure all spiders share same duplicates filter through redis. DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Default requests serializer is pickle, but it can be changed to any module # with loads and dumps functions. Note that pickle is not compatible between # python versions. # Caveat: In python 3.x, the serializer must return strings keys and support # bytes as values. Because of this reason the json or msgpack module will not # work by default. In python 2.x there is no such issue and you can use # 'json' or 'msgpack' as serializers. #SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # Don't cleanup redis queues, allows to pause/resume crawls. #SCHEDULER_PERSIST = True # Schedule requests using a priority queue. (default) #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # Alternative queues. #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue' #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue' # Max idle time to prevent the spider from being closed when distributed crawling. # This only works if queue class is SpiderQueue or SpiderStack, # and may also block the same time when your spider start at the first time (because the queue is empty). #SCHEDULER_IDLE_BEFORE_CLOSE = 10 # Store scraped item in redis for post-processing. ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 300 } # The item pipeline serializes and stores the items in this redis key. #REDIS_ITEMS_KEY = '%(spider)s:items' # The items serializer is by default ScrapyJSONEncoder. You can use any # importable path to a callable object. #REDIS_ITEMS_SERIALIZER = 'json.dumps' # Specify the host and port to use when connecting to Redis (optional). #REDIS_HOST = 'localhost' #REDIS_PORT = 6379 # Specify the full Redis URL for connecting (optional). # If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings. #REDIS_URL = 'redis://user:pass@hostname:9001' # Custom redis client parameters (i.e.: socket timeout, etc.) #REDIS_PARAMS = {} # Use custom redis client class. #REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient' # If True, it uses redis' ``spop`` operation. This could be useful if you # want to avoid duplicates in your start urls list. In this cases, urls must # be added via ``sadd`` command or you will get a type error from redis. #REDIS_START_URLS_AS_SET = False # Default start urls key for RedisSpider and RedisCrawlSpider. #REDIS_START_URLS_KEY = '%(name)s:start_urls' # Use other encoding than utf-8 for redis. #REDIS_ENCODING = 'latin1'
核心配置:
SCHEDULER = "scrapy_redis.scheduler.Scheduler" #将调度器修改 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #去重过滤器修改
请求队列:
# Schedule requests using a priority queue. (default) #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' #优先级队列 # Alternative queues. #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue' #先进先出 #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue' #先进后出,栈
redis连接信息配置:
#REDIS_HOST = 'localhost' #REDIS_PORT = 6379
#REDIS_PASSWORD = xxx
队列、指纹处理:
# Don't cleanup redis queues, allows to pause/resume crawls. #SCHEDULER_PERSIST = True #保留队列、不进行清空,默认是不保留
配置重爬:
SCHEDULER_FLUSH_ON_START = True #如果配置该项,
配置pipeline:
ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 300 #如果配置了该项,数据item就会存储到redis数据库 }
当然还有很多其他的配置,可以根据实际需要进行选择。
settings.py文件修改
。。。。。。。。。