持久化存储操作:
a.磁盘文件
a) 基于终端指令
i. 保证parse方法返回一个可迭代类型的对象(存储解析到的页面内容)
ii. 使用终端指令完成数据存储到指定磁盘文件的操作
1. scrapy crawl 爬虫文件名称 -o 磁盘文件.后缀 如(test.csv)
b)基于管道
i. items: 存储解析到的页面数据
ii. piplines: 处理是持久化存储的相关操作
iii. 代码实现流程:
1. 将解析到的页面数据存储到items对象中
2.使用yield 关键字将items提交给管道文件进行处理
3.在管道文件中编写代码完成数据存储的操作(piplines)
4.在配置文件中开启管道操作
b管道操作的代码如下:
spiders/qiushibai.py
# -*- coding: utf-8 -*-
import scrapy
from qiubai.items import QiubaiItem
class QiushibaiSpider(scrapy.Spider):
name = 'qiushibai'
# allowed_domains = ['www.qiushibaike.com/text/']
start_urls = ['http://www.qiushibaike.com/text//']
def parse(self, response):
# 建议大家使用xpath进行指定内容的解析(框架集成了xpath解析的接口)
# 段子的内容和作者
div_list = response.xpath('//div[@id="content-left"]/div')
# data_list = []
for div in div_list:
# xpath解析到的指定内容被存储到了Selector对象
# extract()该方法可以将Selector对象中存储的数据值拿到
# author = div.xpath("./div/a[2]/h2/text()").extract()[0]
# extract_first() == extract()[0]
author = div.xpath("./div/a[2]/h2/text()").extract_first()
content = div.xpath('.//div[@class="content"]/span/text()').extract_first()
# print(author, '---------------')
# print(content)
# data = {
# "author": author,
# "content": content
# }
# 将解析到数值的数据存储到item对象
item = QiubaiItem()
item["author"] = author
item["content"] = content
# 将item对象提交给管道
yield item
qiubai/items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class QiubaiItem(scrapy.Item):
# define the fields for your item here like:
author = scrapy.Field()
content = scrapy.Field()
qiubai/pipelines.py # 得现在settings.py里搜索pipeline 数字300为优先级
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class QiubaiPipeline(object):
fp = None
# 该方法只会在爬虫开始运行的时候调用一次
def open_spider(self, spider):
print("开始爬虫")
self.fp = open("./data_record.txt", "w", encoding="utf-8")
# 该方法就可以接受爬虫文件中提交过来的item对象,并且对item对象中存储的页面数据进行持久化存储
# 参数item就是接收到的item对象
# 每当爬虫文件向管道提交一次item,则该方法就会被执行一次
def process_item(self, item, spider):
# 取出item中的对象存储数据
author = item["author"]
content = item["content"]
# 持久化存储
self.fp.write(author + ":" + content + "\n\n\n")
return item
# 该方法只会在爬虫结束时调用一次
def close_spider(self, spider):
print("爬虫结束")
self.fp.close()