scrapy爬古诗词

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。因为最近工作需要用到大量数据，于是尝试着学习了scrapy爬虫，scrapy安装与理解网上很多，这里就不慢慢讲了，直接开始我们的工作，爬取古诗词。

1. 创建scrapy项目

[john@localhost scrapy]$ scrapy startproject ShiCi

然后按提示生成项目的内容文件

[john@localhost scrapy]$ cd ShiCi
[john@localhost ShiCi]$ scrapy genspider gushiwen gushiwen.org

下面是正确产生项目后的框架：

[john@localhost scrapy]$ tree ShiCi
ShiCi
├── scrapy.cfg      // 项目的配置文件
└── ShiCi
    ├── __init__.py
    ├── items.py       // item文件
    ├── middlewares.py
    ├── pipelines.py   // 管道文件
    ├── settings.py    // 设置文件
    └── spiders
        ├── gushiwen.py // 爬虫文件
        ├── __init__.py
        └── text.pyc

2. 定义item

为了定义常用的输出数据，scrapy提供了Item类。Item对象是种简单的容器，保存爬取到得数据。它提供了类似于词典的API以及用于声明可用字段的简单语法。
创建scrapy.Item类，并且定义scrapy.Field的类属性来声明一个Item，通过将需要的item模型化，从而获得数据，这里我们主要获取诗词名称、作者和诗词内容。

[john@localhost ShiCi]$ vim items.py


import scrapy

class ShiciItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name    = scrapy.Field()    # 文章名
    author  = scrapy.Field()    # 作者                                                                                                                                                
    content = scrapy.Field()    # 内容

3. 配置爬虫

scrapy中，我们一般把配置项放在settings.py中，这个文件中保存了很多scrapy运行时的配置信息，比如默认的headers，User-Agent，用户定义的pipeline等等，我们只需要根据自己需要的选择取消注释符号就可以，USER_AGENT 里面的内容自己随便填几个就行。

[john@localhost ShiCi]$ vim settings.py


USER_AGENT = {  'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:59.0) Gecko/20100101 Firefox/59.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0'    }

ROBOTSTXT_OBEY = False  # 不遵守robots.txt规则
DOWNLOAD_DELAY = 1  # 爬虫延时

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en, zh-CN',
}

ITEM_PIPELINES = {
    'ShiCi.pipelines.ShiciPipeline': 300,
}

4. 创建爬虫

分析网页中我们需要的诗词数据在网页HTML中的位置信息，然后在parse方法中使用选择器的xpath提取数据，如下图数据信息（xpath）：
古诗词信息
有了上面的路径，我们就可以开始写我们的parse了

[john@localhost spiders]$ vim gushiwen.py

import scrapy
from ShiCi.items import ShiciItem

import re

class TextSpider(scrapy.Spider):
    name = 'gushiwen'
    allowed_domains = ['gushiwen.org']
    start_urls = ['https://www.gushiwen.org']

    def parse(self, response):
        re_br = re.compile('<br\s*?/?>')    # 将<br>转为换行
        re_p  = re.compile('<p\s*?/?>')     # 去掉<p>
        re_h = re.compile('</?\w+[^>]*>')   # 去掉html标签

        informations = response.xpath("//div[@class='left']/div[@class='sons']")
        for information in informations:
            item = ShiciItem()

            item['name'] = information.xpath('./div[@class="cont"]/p/a/b/text()').extract_first()
            item['author'] = information.xpath('div[@class="cont"]/p[@class="source"] | a//text()').extract_first()
            item['content'] = information.xpath('./div[@class="cont"]/div[@class="contson"] | p//text()').extract_first()

            item['author'] = re_h.sub('', item['author'])

            item['content'] = re_br.sub('\n', item['content'])
            item['content'] = re_p.sub('', item['content'])
            item['content'] = re_h.sub('', item['content'])

            print item['name']
            print item['author']
            print item['content']

            yield item

5. 数据保存

编写好Spider后，我们就能获取到item了，然后，我们需要在pipelines.py中编写ShiciPipeline来保存我们的item数据，这里，我们将其保存在txt文档中。

[john@localhost ShiCi]$ vim pipelines.py

class ShiciPipeline(object):
    def process_item(self, item, spider):
        with open('古诗词.txt', "a") as f:
            f.write(item['name'].encode("utf8") + '\n')
            f.write(item['author'].encode("utf8") + '\n')
            f.write(item['content'].encode("utf8") + '\n')
            f.write('-----------------------------------------------------------------\n')                                                                                            
        return item

到这里我们的爬虫程序基本完成，测试一下：

[john@localhost ShiCi]$ scrapy crawl gushiwen

诗词信息
诗词文件
出现上面的信息，表示我的程序成功的爬下来了数据并保存在txt中。但是我们发现该网站上的古诗文有100页，下面我们就来分页爬数据。

6. 分页数据

点击下一页，发现网址变化规律https://www.gushiwen.org/ + default_2.aspx，发现每次会出现page变化
还是和上面一样，找到页面的html信息，然后在爬虫文件里修改url

[john@localhost spiders]$ vim gushiwen.py 

 page_url = response.xpath('//div[@class="left"]/form[@id="FromPage"]/div[@class="pagesright"]/a[@class="amore"]/@href').get()
if page_url:
    #print "https://www.gushiwen.org" + page_url
    yield scrapy.Request("https://www.gushiwen.org" + page_url, callback = self.parse)

重新测试一下：

[john@localhost ShiCi]$ scrapy crawl gushiwen

发现抓了五六千行的诗词，分页完成。

7. github源码

古诗词