python爬虫(十三)-------------------HelloWorld级scrapy(scrapy spider组件)

scrapy spider组件 :

该组件的三大对象:

response对象

item 对象

Request对象

步骤:

先初始化请求URL列表，并指定下载后处理response的回调函数。

在parse回调中解析response并返回字典,Item 对象,Request对象或它们的迭代对象。

在回调函数里面，使用选择器解析页面内容，并生成解析后的结果Item。

最后返回的这些Item通常会被持久化到数据库中(使用Item Pipeline)或者使用Feed exports将其保存到文件中

import scrapy
#scrapy runspider G:\360MoveData\Users\ASUS\Desktop\pythontest01.py -o fistscrapy02.json

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']
    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            '''yield一个字典可以在-o fistscrapy02.json时可以将结果保存在fistscrapy02.json中
            一共有这几种-o输出文件格式('json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle')
            '''
            yield {
                'text': quote.xpath('span[@class="text"]/text()').extract_first(),
                'author': quote.xpath('span/small[@class="author"]/text()').extract_first(),
            }	
        next_page = response.xpath('//li[@class="next"]/@herf').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)#字符串拼接
            yield scrapy.Request(next_page, callback=self.parse)#request添加到url队列,并用回调函数将新的页面仍然用self.parse解析

python爬虫(十三)-------------------HelloWorld级scrapy(scrapy spider组件)

猜你喜欢