创建项目
scrapy startproject mySpider
生成一个爬虫
scrapy genspider itcast “itcast.cn”
name = 'itcast' # 爬虫名
allowed_domains = ['itcast.cn'] # 允许爬的范围
提取数据
完善spider,使用xpath等方法
- settings中加上
LOG_LEVEL = "WARNING"
去掉一些log日志 - 管道
ITEM_PIPELINES = {
'myspider.pipelines.MyspiderPipeline': 300,
'myspider.pipelines.MyspiderPipeline1': 301,
}
保存数据
pipeline中保存数据
itcast.py
# -*- coding: utf-8 -*-
import scrapy
class ItcastSpider(scrapy.Spider):
name = 'itcast' # 爬虫名
allowed_domains = ['itcast.cn'] # 允许爬的范围
start_urls = ['http://www.itcast.cn/channel/teacher.shtml'] # 最开始请求的url地址
def parse(self, response):
# 处理start_url地址的相应的
# ret = response.xpath("//div[@class='tea_con']//h3/text()").extract()
# print(ret)
# 分组
li_list = response.xpath("//div[@class='tea_con']//li")
item = {}
for li in li_list:
item["name"] = li.xpath(".//h3/text()").extract_first()
item["title"] = li.xpath(".//h4/text()").extract_first()
# print(item)
yield item
利用xpath提取数据
管道处理数据
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class MyspiderPipeline(object):
def process_item(self, item, spider):
item["hello"] = "world"
return item
class MyspiderPipeline1(object):
def process_item(self, item, spider):
print(item)
return item
因为在setting里设置的管道1为300,管道2为301,所以,先经过管道1再经过管道2,在管道1中把字典加上了一个字段"hello",管道2就能收到这个数据