spider爬取多级url

Quotes to Scrapes【谚语网站】等级：初级

爬取每条谚语的信息（谚语、作者、标签、作者出生日期、作者出事地点、作者基本描述）

思路：

1、初始url[网站网址]：http://quotes.toscrape.com/

2、得到初始url的response，传递给parse1函数（负责解析第一级页面），解析response；

3、获取到每条谚语下一级页面的url，将其链接传递给parse2函数（负责解析第二级页面）入队列；

4、parse2函数会解析每个二级页面的url的response，得到最终数据；

易忽略点：

1、因为谚语的二级页面的url是根据作者来定义url路径的，因此有很多重复的二级url，需要不去重操作；

2、定位元素的时候要多观察页面元素的结构，在爬取的过程中，因为定位“下一页”元素写的不精准导致总是少了很多条数据，一开始以为是去重机制导致的，加上了不去重却让爬虫陷入了死循环。

next_url = response.xpath('//ul[@class="pager"]//a/@href').extract()[0]
if next_url is not None:
            yield scrapy.Request(url=self.base_url + next_url, callback=self.parse)

这样写next_url获取到的是前一页的url，if里的条件永远满足。分析元素树结构的时候忽略了还有前一页的元素导致的。正确的next_url代码见下方。

准备工作：新建项目、新建爬虫

明确目标：item.py

import scrapy


class QuotesToscrapeItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 名言
    quote = scrapy.Field()
    # 作者
    author = scrapy.Field()
    # 标签
    tags = scrapy.Field()

    # 出生日期
    born_date = scrapy.Field()
    # 出生位置
    born_location = scrapy.Field()
    # 描述
    description = scrapy.Field()

定义爬虫：quotes.py

[去重机制：Request的参数dont_filter默认是False（去重），每yield一个Request，就将url参数与调度器内已有的url进行比较，如果存在相同url则默认不入队列，如果没有相同的url则入队列，每一个url入队列前都要与现有的url进行比较。如果想要实现不去重效果，则将dont_filter改为True]

# -*- coding: utf-8 -*-
import scrapy
from quotes_toscrape.items import QuotesToscrapeItem


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    # start_urls = ['http://quotes.toscrape.com/']
    base_url = 'http://quotes.toscrape.com/'

    page = 1
    first_url ='http://quotes.toscrape.com/page/{}/'
    start_urls = [first_url.format(page)]

    def parse(self, response):
        node_list = response.xpath('//div[@class="quote"]')
        for node in node_list:
            quote = node.xpath('.//span[@class="text"]/text()').extract()[0][1:-1]
            author = node.xpath('.//small/text()').extract()[0]
            tags = node_list.xpath('.//div[@class="tags"]//a/text()').extract()[0]
            href = self.base_url + node.xpath('.//small/following-sibling::a/@href').extract()[0]
            yield scrapy.Request(url=href, meta={'quote': quote, 'author': author, 'tags': tags},
                                 callback=self.parse_author, dont_filter=True)    

        next_url = response.xpath('//ul[@class="pager"]/li[@class="next"]/a/@href').extract()[0]
        if next_url is not None:
            yield scrapy.Request(url=self.base_url + next_url, callback=self.parse)
        # if self.page<10:
        #     self.page += 1
        #     yield scrapy.Request(url=self.first_url.format(self.page), callback=self.parse)

    def parse_author(self,response):
        item = QuotesToscrapeItem()
        # 组合信息
        item['quote'] = response.meta['quote']
        item['author'] = response.meta['author']
        item['tags'] = response.meta['tags']
        item['born_date'] = response.xpath('//span[@class="author-born-date"]/text()').extract()[0]
        item['born_location'] = response.xpath('//span[@class="author-born-location"]/text()').extract()[0][3:]
        # 去掉前后空格
        item['description'] = response.xpath('//div[@class="author-description"]/text()').extract()[0].strip()
        yield item

定义管道：pipelines.py

import json


class QuotesToscrapePipeline(object):
    def __init__(self):
        self.file = open('quotes.json','wb')

    def process_item(self, item, spider):
        data = json.dumps(dict(item),ensure_ascii=False,indent=4) +','
        # 编码
        self.file.write(data.encode('utf-8'))
        return item

    def close_spider(self,spider):
        self.file.close()

猜你喜欢