笔记-scrapy-pipeline
1.简介
scrapy抓取数据后,使用yield发送item对象至pipeline,pipeline顺序对item进行处理。
一般用于:
清洗,验证,检查数据;
存储数据;
2.使用
将数据保存到json文件中示例
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('items.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
3.类及方法介绍
process_item(self, item, spider)
This method is called for every item pipeline component. process_item() must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components.
Parameters:
item (Item object or a dict) – the item scraped
spider (Spider object) – the spider which scraped the item
偶尔也会使用以下方法:
open_spider(self, spider)
This method is called when the spider is opened.
Parameters: spider (Spider object) – the spider which was opened
close_spider(self, spider)
This method is called when the spider is closed.
Parameters: spider (Spider object) – the spider which was closed
from_crawler(cls, crawler)
If present, this classmethod is called to create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy.
Parameters: crawler (Crawler object) – crawler that uses this pipeline
4.更多用法
激活pipeline
如果想要使用pipeline,需要在settings文件中设置如下:
ITEM_PIPELINES = {
'myproject.pipelines.PricePipeline' : 300 ,
'myproject.pipelines.JsonWriterPipeline' : 800 ,
}
数值决定运行顺序,越小越优先,设置范围为0-1000。