笔记-scrapy-pipeline

1.简介

scrapy抓取数据后，使用yield发送item对象至pipeline，pipeline顺序对item进行处理。

一般用于：

　　清洗，验证，检查数据；

　　存储数据；

2.使用

将数据保存到json文件中示例

import json

class JsonWriterPipeline(object):

　　def open_spider(self, spider):
　　　　self.file = open('items.jl', 'w')

　　def close_spider(self, spider):
　　　　self.file.close()

　　def process_item(self, item, spider):
　　　　line = json.dumps(dict(item)) + "\n"
　　　　self.file.write(line)
　　　　return item

3.类及方法介绍

process_item(self, item, spider)

This method is called for every item pipeline component. process_item() must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components.

Parameters:
item (Item object or a dict) – the item scraped
spider (Spider object) – the spider which scraped the item

偶尔也会使用以下方法：

open_spider(self, spider)

This method is called when the spider is opened.

Parameters: spider (Spider object) – the spider which was opened

close_spider(self, spider)

This method is called when the spider is closed.

Parameters: spider (Spider object) – the spider which was closed

from_crawler(cls, crawler)

If present, this classmethod is called to create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy.

Parameters: crawler (Crawler object) – crawler that uses this pipeline

4.更多用法

激活pipeline

如果想要使用pipeline，需要在settings文件中设置如下：

ITEM_PIPELINES = {
　　'myproject.pipelines.PricePipeline' ： 300 ，
　　'myproject.pipelines.JsonWriterPipeline' ： 800 ，
}

数值决定运行顺序，越小越优先，设置范围为0-1000。