Scrapy官推入门网站：Quotes to Scrape

Scrapy爬取网站：Quotes to Scrape

写在前面

这是一个scrapy官方提供的网站：http://quotes.toscrape.com/ ，网页很简单，麻雀虽小五脏俱全。就用这个网站详细演示下Scrapy基础使用方法。能力有限，水平一般，尽力做好哈。本教程所有代码以打包上传：https://download.csdn.net/download/qq_42776455/10727591
在这里插入图片描述

创建Scrapy项目

创建项目：

scrapy startproject quotes

创建spider：

这里出现报错，原因是项目名称和爬虫名字不可以相同。

>scrapy genspider quotes quotes.toscrape.com
Cannot create a spider with the same name as your project

更改之后，创建成功。

>scrapy genspider quote quotes.toscrape.com
Created spider 'quote' using template 'basic' in module:
  quotes.spiders.quote

爬取数据

先知道自己要怕什么数据
这个网站就很清楚了，每一条每一条的包括名言，作者，标签。后面用quote，author，tags来分别代替。

分析数据的位置
这里用xpath来解析

root = '//div[@class="quote"]'
quote = './span[@class="text"]/text()'
author = './/small[@class='author']/text()'
tags = './/a[@class="tag"]/text()'

分析拿到下一页的url

	next = response.xpath('//li[@class="next"]/a/@href').extract_first()
    if next:
        next_page = response.urljoin(next)
        yield scrapy.Request(url=next_page,callback=self.parse)

next_page解析出来的是"/page/2/"，用urljoin方法将其加入到start_url后面，然后再用scrapy.Request()发送请求，解析方式还是parse()。
在这里插入图片描述

爬虫完整代码

import scrapy
from quotes.items import QuotesItem
class QuoteSpider(scrapy.Spider):
	name = 'quote'
	allowed_domains = ['quotes.toscrape.com']
	start_urls = ['http://quotes.toscrape.com/']

	def parse(self, response):
    	item = QuotesItem()
   	 	quotes = response.xpath('//div[@class="quote"]')
    	for quote_ in quotes:
        	quote = quote_.xpath('./span[@class="text"]/text()').extract_first()
        	author = quote_.xpath('.//small[@class="author"]/text()').extract_first()
        	tags = quote_.xpath('.//a[@class="tag"]/text()').extract()
        	item['quote'] = quote
        	item['author'] = author
       	 	item['tags'] = tags
        	yield item
        next = response.xpath('//li[@class="next"]/a/@href').extract_first()
    	if next:
        next_page = response.urljoin(next)
        yield scrapy.Request(url=next_page,callback=self.parse)

数据保存

保存为json数据

scrapy crawl quote -o quotes.json

在这里插入图片描述

保存到MongoDB中

对数据进行操作需要借助pipelines.py，并配置settings.py。

配置settings.py
这个本来就有，只需要取消注释。

ITEM_PIPELINES = {
'quotes.pipelines.QuotesPipeline': 300,
}

配置MongoDB相关的，需要手动写入：

# mongodb config
mongo_host = '127.0.0.1'
mongo_port = 27017
mongo_db_name = 'quotes'
mongo_db_collection = 'quotes_infor'

编辑pipelines.py
首先要记着将settings.py导入到该文件中。

import pymongo
from quotes.settings import *
class QuotesPipeline(object):
    def __init__(self):
        host = mongo_host
        port = mongo_port
        dbname = mongo_db_name
        sheetname = mongo_db_collection
        client = pymongo.MongoClient(host=host, port=port)
        mydb = client[dbname]
        self.post = mydb[sheetname]

    def process_item(self, item, spider):
        '''
        :param item: 来自于spider中yield到piplines的数据。是列表的格式，保存之前先转换为dict。
        :param spider:
        :return:
        '''
        data = dict(item)
        self.post.insert(data)
        return item

用MongoDB图形化软件查看结果。在这里插入图片描述

OK，完成任务。不过今天可能是比较马虎的一天，拖拉点时间。