测试环境
# 环境一 Python 3.6.5 Scrapy==1.5.0 # 环境二 Python 2.7.5 Scrapy==1.1.2
一、命令行运行爬虫
1、编写爬虫文件 baidu.py
# -*- coding: utf-8 -*- from scrapy import Spider class BaiduSpider(Spider): name = 'baidu' start_urls = ['http://baidu.com/'] def parse(self, response): self.log("run baidu")
2、运行爬虫(2种方式)
# 运行爬虫 $ scrapy crawl baidu # 在没有创建项目的情况下运行爬虫 $ scrapy runspider baidu.py
二、文件中运行爬虫
1、cmdline方式运行爬虫
# -*- coding: utf-8 -*- from scrapy import cmdline, Spider class BaiduSpider(Spider): name = 'baidu' start_urls = ['http://baidu.com/'] def parse(self, response): self.log("run baidu") if __name__ == '__main__': cmdline.execute("scrapy crawl baidu".split())
2、CrawlerProcess方式运行爬虫
# -*- coding: utf-8 -*- from scrapy import Spider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings class BaiduSpider(Spider): name = 'baidu' start_urls = ['http://baidu.com/'] def parse(self, response): self.log("run baidu") if __name__ == '__main__': # 通过方法 get_project_settings() 获取配置信息 process = CrawlerProcess(get_project_settings()) process.crawl(BaiduSpider) process.start()
3、通过CrawlerRunner 运行爬虫
# -*- coding: utf-8 -*- from scrapy import Spider from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging from twisted.internet import reactor class BaiduSpider(Spider): name = 'baidu' start_urls = ['http://baidu.com/'] def parse(self, response): self.log("run baidu") if __name__ == '__main__': # 直接运行控制台没有日志 configure_logging( { 'LOG_FORMAT': '%(message)s' } ) runner = CrawlerRunner() d = runner.crawl(BaiduSpider) d.addBoth(lambda _: reactor.stop()) reactor.run()
三、文件中运行多个爬虫
项目中新建一个爬虫 SinaSpider
python学习扣qun【 1004391443】,内有安装包和学习视频资料免费分享,好友都会在里面交流,分享一些学习的方法和需要注意的小细节,每天也会准时的讲一些项目实战案例,欢迎加入
# -*- coding: utf-8 -*- from scrapy import Spider class SinaSpider(Spider): name = 'sina' start_urls = ['https://www.sina.com.cn/'] def parse(self, response): self.log("run sina")
1、cmdline方式不可以运行多个爬虫
如果将两个语句放在一起,第一个语句执行完后程序就退出了,执行到不到第二句
# -*- coding: utf-8 -*- from scrapy import cmdline cmdline.execute("scrapy crawl baidu".split()) cmdline.execute("scrapy crawl sina".split())
记得之前我还写过一个使用 cmdline运行多个爬虫的脚本
文章:Python爬虫:scrapy定时运行的脚本
不过有了以下两个方法来替代,就更优雅了
2、CrawlerProcess方式运行多个爬虫
备注:爬虫项目文件为:
scrapy_demo/spiders/baidu.py
scrapy_demo/spiders/sina.py
# -*- coding: utf-8 -*- from scrapy.crawler import CrawlerProcess from scrapy_demo.spiders.baidu import BaiduSpider from scrapy_demo.spiders.sina import SinaSpider process = CrawlerProcess() process.crawl(BaiduSpider) process.crawl(SinaSpider) process.start()
此方式运行,发现日志中中间件只启动了一次,而且发送请求基本是同时的,说明这两个爬虫运行不是独立的,可能会相互干扰
3、通过CrawlerRunner 运行多个爬虫
# -*- coding: utf-8 -*- from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging from twisted.internet import reactor from scrapy_demo.spiders.baidu import BaiduSpider from scrapy_demo.spiders.sina import SinaSpider configure_logging() runner = CrawlerRunner() runner.crawl(BaiduSpider) runner.crawl(SinaSpider) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run()
此方式也只加载一次中间件,不过是逐个运行的,会减少干扰,官方文档也推荐使用此方法来运行多个爬虫