scrapy实现分布式

redis的准备工作：
　　1.对redis配置文件进行配置：
　　　　- 注释该行：bind 127.0.0.1，表示可以让其他ip访问redis
　　　　- 将yes该为no：protected-mode no，表示可以让其他ip操作redis
　　2.启动redis：
　　　　mac/linux: redis-server redis.conf
　　　　windows: redis-server.exe redis-windows.conf
实现分布式爬虫的操作步骤：
　　1. 将redis数据库的配置文件进行改动： .修改值 protected-mode no .注释 bind 127.0.0.1
　　2. 下载scrapy-redis
　　pip3 install scraps-redis
　　3. 创建工程 scrapy startproject 工程名
　　scrapy startproject 工程名
　　4. 创建基于scrawlSpider的爬虫文件
　　cd 工程名
　　scrapy genspider -t crawl 项目名
　　5. 导入RedisCrawlSpider类
　　from scrapy_redis.spiders import RedisCrawlSpider
　　6. 在现有代码的基础上进行连接提取和解析操作
　　class RidesdemoSpider(RedisCrawlSpider):
　　redis_key = “redisQueue”
7. 将解析的数据值封装到item中，然后将item对象提交到scrapy-redis组件中的管道里(自建项目的管道没什么用了，可以直接删除了，用的是组件封装好的scrapy_redis.pipelines中)
　　ITEM_PIPELINES = {
　　 ‘scrapy_redis.pipelines.RedisPipeline’: 400,
　　}
.　　8. 管道会将数据值写入到指定的redis数据库中（在配置文件中进行指定redis数据库ip的编写）
　　REDIS_HOST = ‘192.168.137.76’
　　REDIS_PORT = 6379
　　REDIS_ENCODING = ‘utf-8’
　　# REDIS_PARAMS = {‘password’:’123456’}
　　9. 在当前工程中使用scrapy-redis封装好的调度器（在配置文件中进行配置）
　　# 使用scrapy-redis组件的去重队列（过滤）
　　DUPEFILTER_CLASS = “scrapy_redis.dupefilter.RFPDupeFilter”
　　# 使用scrapy-redis组件自己的调度器(核心代码共享调度器)
　　SCHEDULER = “scrapy_redis.scheduler.Scheduler”
　　# 是否允许暂停
　　SCHEDULER_PERSIST = True
　　11. 启动redis服务器：
　　redis-server redis.windows.conf windows系统
　　redis-server redis.conf mac系统
　　12. 启动redis-cli
redis-cli
　　13. 执行当前爬虫文件：
　　scrapy runspider 爬虫文件.py
　　14. 向队列中扔一个起始url>>>在redis-cli执行扔的操作:
　　lpush redis_key的value值起始url

spider.py

# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from scrapy_redis.spiders import RedisCrawlSpider
from redisScrapyPro.items import RedisscrapyproItem

class RidesdemoSpider(RedisCrawlSpider):
    name = 'redisDemo'

    # scrapy_redis的调度器队列的名称，最终我们会根据该队列的名称向调度器队列中扔一个起始url
    redis_key = "redisQueue"

    link = LinkExtractor(allow=r'https://dig.chouti.com/.*?/.*?/.*?/\d+')
    link1 = LinkExtractor(allow=r'https://dig.chouti.com/all/hot/recent/1')
    rules = (
        Rule(link, callback='parse_item', follow=True),
        Rule(link1, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        div_list = response.xpath('//*[@id="content-list"]/div')
        for div in div_list:
            content = div.xpath('string(./div[@class="news-content"]/div[1]/a[1])').extract_first().strip().replace("\t","")
            print(content)
            item = RedisscrapyproItem()
            item['content'] = content
            yield item

settings.py

BOT_NAME = 'redisScrapyPro'

SPIDER_MODULES = ['redisScrapyPro.spiders']
NEWSPIDER_MODULE = 'redisScrapyPro.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400,
}
REDIS_HOST = '192.168.137.76'
REDIS_PORT = 6379
REDIS_ENCODING = 'utf-8'
# REDIS_PARAMS = {‘password’:’123456’}


# 使用scrapy-redis组件的去重队列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允许暂停
SCHEDULER_PERSIST = True

猜你喜欢