关于如何在使用scrapy时传入自定义参数,百度了很久,基本都是这种:
在命令行用crawl控制spider爬取的时候,加上-a选项,例如:
scrapy crawl myspider -a category=electronics
然后在spider里这样写:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self, category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.example.com/categories/%s' % category]
也就是在spider的构造函数里加上带入的参数即可。
==================================================================================
不过这种方法感觉不够完美漂亮,后面在github查找百度贴吧的爬虫时,终于找到了一种完美的传入自定义参数 的方法,先看看那位大佬的写法:
scrapy run 仙剑五外传 -gs -p 5 12 -f thread_filter
使用只看楼主模式爬仙剑五外传吧精品帖中第5页到第12页的帖子,其中能通过过滤器filter.py
中的thread_filter
函数的帖子及其内容会被存入数据库。
地址是:https://github.com/Aqua-Dream/Tieba_Spider
============================================================================
让我稍微讲解下他的传入参数的过程跟方法:
先是作为启动的commands文件夹(跟spider文件夹同一级)里run.py文件:
import scrapy.commands.crawl as crawl
from scrapy.exceptions import UsageError
from scrapy.commands import ScrapyCommand
import config
import filter
class Command(crawl.Command):
def syntax(self):
return "<tieba_name> <database_name>"
def short_desc(self):
return "Crawl tieba"
def long_desc(self):
return "Crawl baidu tieba data to a MySQL database."
def add_options(self, parser):
ScrapyCommand.add_options(self, parser)
parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
help="set spider argument (may be repeated)")
parser.add_option("-o", "--output", metavar="FILE",
help="dump scraped items into FILE (use - for stdout)")
parser.add_option("-t", "--output-format", metavar="FORMAT",
help="format to use for dumping items with -o")
parser.add_option("-p", "--pages", nargs = 2, type="int", dest="pages", default=[],
help="set the range of pages you want to crawl")
parser.add_option("-g", "--good", action="store_true", dest="good_only", default=False,
help="only crawl good threads and their posts and comments")
parser.add_option("-f", "--filter", type="str", dest="filter", default="",
help='set function name in "filter.py" to filter threads')
parser.add_option("-s", "--see_lz", action="store_true", dest="see_lz", default=False,
help='enable "only see lz" mode')
def set_pages(self, pages):
if len(pages) == 0:
begin_page = 1
end_page = 999999
else:
begin_page = pages[0]
end_page = pages[1]
if begin_page <= 0:
raise UsageError("The number of begin page must not be less than 1!")
if begin_page > end_page:
raise UsageError("The number of end page must not be less than that of begin page!")
self.settings.set('BEGIN_PAGE', begin_page, priority='cmdline')
self.settings.set('END_PAGE', end_page, priority='cmdline')
def run(self, args, opts):
self.set_pages(opts.pages)
self.settings.set('GOOD_ONLY', opts.good_only)
self.settings.set('SEE_LZ', opts.see_lz)
if opts.filter:
try:
opts.filter = eval('filter.' + opts.filter)
except:
raise UsageError("Invalid filter function name!")
self.settings.set("FILTER", opts.filter)
cfg = config.config()
if len(args) >= 3:
raise UsageError("Too many arguments!")
for i in range(len(args)):
if isinstance(args[i], bytes):
args[i] = args[i].decode("utf8")
self.settings.set('MYSQL_HOST', cfg.config['MYSQL_HOST'])
self.settings.set('MYSQL_USER', cfg.config['MYSQL_USER'])
self.settings.set('MYSQL_PASSWD', cfg.config['MYSQL_PASSWD'])
tbname = cfg.config['DEFAULT_TIEBA']
if len(args) >= 1:
tbname = args[0]
dbname = None
if tbname in cfg.config['MYSQL_DBNAME'].keys():
dbname = cfg.config['MYSQL_DBNAME'][tbname]
if len(args) >= 2:
dbname = args[1]
cfg.config['MYSQL_DBNAME'][tbname] = dbname
if not dbname:
raise UsageError("Please input database name!")
self.settings.set('TIEBA_NAME', tbname, priority='cmdline')
self.settings.set('MYSQL_DBNAME', dbname, priority='cmdline')
config.init_database(cfg.config['MYSQL_HOST'], cfg.config['MYSQL_USER'], cfg.config['MYSQL_PASSWD'], dbname)
log = config.log(tbname, dbname, self.settings['BEGIN_PAGE'], opts.good_only, opts.see_lz)
self.settings.set('SIMPLE_LOG', log)
self.crawler_process.crawl('tieba', **opts.spargs)
self.crawler_process.start()
cfg.save()
第一次看,确实有点儿复杂,实际上,其实就是scrapy原文件里的commands目录下的crawl.py文件的修改版
现在我来说说重点:
a。设定parser.option
parser.add_option("-p", "--pages", nargs = 2, type="int", dest="pages", default=[],
help="set the range of pages you want to crawl")
parser.add_option("-g", "--good", action="store_true", dest="good_only", default=False,
help="only crawl good threads and their posts and comments")
parser.add_option("-f", "--filter", type="str", dest="filter", default="",
help='set function name in "filter.py" to filter threads')
parser.add_option("-s", "--see_lz", action="store_true", dest="see_lz", default=False,
help='enable "only see lz" mode')
这部分就是设定选项参数了,
scrapy run 仙剑五外传 -gs -p 5 12 -f thread_filter
可以设定输入哪些参数
b.把输入了的 参数在settings里设置
self.settings.set('BEGIN_PAGE', begin_page, priority='cmdline')
self.settings.set('END_PAGE', end_page, priority='cmdline')
self.settings.set('GOOD_ONLY', opts.good_only)
self.settings.set('SEE_LZ', opts.see_lz)
priority 就是优先级别,,我也不太懂
c.启动贴吧
self.crawler_process.crawl('tieba', **opts.spargs)
self.crawler_process.start()
不过!到这里还没完成!还得在settings.py跟pipeline.py两个文件设定启动跟配置参数
pipeline.py
class TiebaPipeline(object):
@classmethod #初始化时会调用这个函数?
def from_settings(cls, settings): #cls就是这个类,cls(settings)就是相当于TiebaPipeline(settings)
return cls(settings)
def __init__(self, settings):#获取settings的信息
dbname = settings['MYSQL_DBNAME']
tbname = settings['TIEBA_NAME']
if not dbname.strip():
raise ValueError("No database name!")
if not tbname.strip():
raise ValueError("No tieba name!")
if isinstance(tbname, unicode):
settings['TIEBA_NAME'] = tbname.encode('utf8')
self.settings = settings
self.dbpool = adbapi.ConnectionPool('MySQLdb',
host=settings['MYSQL_HOST'],
db=settings['MYSQL_DBNAME'],
user=settings['MYSQL_USER'],
passwd=settings['MYSQL_PASSWD'],
charset='utf8mb4',
cursorclass = MySQLdb.cursors.DictCursor,
init_command = 'set foreign_key_checks=0' #异步容易冲突
)
def open_spider(self, spider):#设定了多个spider参数
spider.cur_page = begin_page = self.settings['BEGIN_PAGE']
spider.end_page = self.settings['END_PAGE']
spider.filter = self.settings['FILTER']
spider.see_lz = self.settings['SEE_LZ']
start_url = "http://tieba.baidu.com/f?kw=%s&pn=%d" \
%(quote(self.settings['TIEBA_NAME']), 50 * (begin_page - 1))
if self.settings['GOOD_ONLY']:
start_url += '&tab=good'
spider.start_urls = [start_url]
def close_spider(self, spider):
self.settings['SIMPLE_LOG'].log(spider.cur_page - 1) #调用了config里的log函数的log方法
说几个重点:
1.
@classmethod #初始化时会调用这个函数?
def from_settings(cls, settings): #cls就是这个类,cls(settings)就是相当于TiebaPipeline(settings)
return cls(settings)
这个是调用spider里的settings配置,包括之前设定好的参数
2.参数传入spider爬虫里
def open_spider(self, spider):#设定了多个spider参数
spider.cur_page = begin_page = self.settings['BEGIN_PAGE']
spider.end_page = self.settings['END_PAGE']
spider.filter = self.settings['FILTER']
spider.see_lz = self.settings['SEE_LZ']
start_url = "http://tieba.baidu.com/f?kw=%s&pn=%d" \
%(quote(self.settings['TIEBA_NAME']), 50 * (begin_page - 1))
if self.settings['GOOD_ONLY']:
start_url += '&tab=good'
spider.start_urls = [start_url]
def close_spider(self, spider):
self.settings['SIMPLE_LOG'].log(spider.cur_page - 1) #调用了config里的log函数的log方法
在pipeline的类里面 open_spider和close_spider两个方法,分别是在爬虫启动和结束时的回调方法。
在爬虫spider主体传入参数,就是在这里设定的
settings.py
最后,再把配置文件设定好就行了!
ITEM_PIPELINES = {
'ds1.pipelines.Ds1Pipeline': 4,
}
COMMANDS_MODULE = 'ds1.commands'
--------------------------------------------------------------------------------------------------------
此时就能类似这样传入参数启动爬虫了!
scrapy run 仙剑五外传 -gs -p 5 12
*********************************************************************************************************************
说几个我自己测试过程中发现的几点
一。在使用配置settings时(在pipeline.py文件),好像有两种方法:
@classmethod
def from_settings(cls, settings):
return cls(settings)
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
后面调用了下,发现,,原来是等价的!
二。在pipeline的类里面 open_spider和close_spider两个方法,分别是在爬虫启动和结束时的回调方法。
但是一般Pipeline的类一般都有好几个的,如果每一个类都设定回调方法,会怎样?
调试了下,很有意思,爬虫启动时传入的spider.start_url,是以最后传入(settings.py设定的顺序) 为准,就是后面会覆盖前面的,
然而结束时,最后启动的close_spider居然是第一个启动的Pipeline类,这就比较有意思了!
三。到底传入的参数,运行顺序是怎样的?
spider.py 有参数,pipeline.py的open_spider可以设定参数,但是爬虫时又可以设定start_requst,那么那个先那个后呢?
经过调试,以start_url为目标,我发现参数的变化是:
启动爬虫后,先获取spider.py 文件的class spider() 下的start_url ,然后检查pipeline.py 是否调用open_spider,如果设定了spider.start_url,那么就覆盖掉,此时再把参数回调到start_requst,正式进入爬虫!
也就是说,参数的优先等级:start_request>open_spider>class spider的属性