scrapy抓取拉勾网职位信息（七）——实现分布式

上篇我们实现了数据的存储，包括把数据存储到MongoDB，Mysql以及本地文件，本篇说下分布式。

我们目前实现的是一个单机爬虫，也就是只在一个机器上运行，想象一下，如果同时有多台机器同时运行这个爬虫，并且把数据都存储到同一个数据库，那不是美滋滋，速度也得到了很大的提升。

要实现分布式，只需要对settings.py文件进行适当的配置就能完成。

我们先看下scrapy-redis的用法，官方文档介绍如下：

Use the following settings in your project:

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10

# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300
}

# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'

# The items serializer is by default ScrapyJSONEncoder. You can use any
# importable path to a callable object.
#REDIS_ITEMS_SERIALIZER = 'json.dumps'

# Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379

# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
#REDIS_URL = 'redis://user:pass@hostname:9001'

# Custom redis client parameters (i.e.: socket timeout, etc.)
#REDIS_PARAMS  = {}
# Use custom redis client class.
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'

# If True, it uses redis' ``spop`` operation. This could be useful if you
# want to avoid duplicates in your start urls list. In this cases, urls must
# be added via ``sadd`` command or you will get a type error from redis.
#REDIS_START_URLS_AS_SET = False

# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'

# Use other encoding than utf-8 for redis.
#REDIS_ENCODING = 'latin1'

核心配置：

SCHEDULER = "scrapy_redis.scheduler.Scheduler"   #将调度器修改
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"  #去重过滤器修改

请求队列：

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' #优先级队列

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'  #先进先出
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'  #先进后出，栈

redis连接信息配置：

#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379
#REDIS_PASSWORD = xxx

队列、指纹处理：

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True   #保留队列、不进行清空，默认是不保留

配置重爬：

SCHEDULER_FLUSH_ON_START = True #如果配置该项，

配置pipeline：

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300  #如果配置了该项，数据item就会存储到redis数据库
}

当然还有很多其他的配置，可以根据实际需要进行选择。

settings.py文件修改

。。。。。。。。。

scrapy抓取拉勾网职位信息（七）——实现分布式

猜你喜欢