前言

今天学习爬虫网页项目时遇到xpath解析问题，纠结了十几分钟也没成功解决。让我不安的是这个知识点不难，而且之前已经重复学习了多次，如此的记忆效果使我不得不重新审视笔记的作用。很显然一些博客记录学习笔记和反刍学习内容正变得迫在眉睫，简直到了不可不做的地步了。

scapy安装

安装过程费劲，csdn教程很多，逐步下载相应文件一步步来，需要耐心。遇到pip无法下载的第一选择失去换镜像源，然后再考虑.whl文件安装

scrapy基础理论知识

记于软面笔记本上结合实体书略看略记，实践第一

scrapy实例记录

下载小说章节名和相应链接

一、建立项目及start.py文件

项目目录
start.py内容

from scrapy import cmdline
cmdline.execute(['scrapy', 'crawl', 'biquge'])

二、梳理流程

[1]settings.py基本设置（请求头、协议、pipeline）
[2]biquge.py代码内容(获取、解析网页，得到item,并yield)
[3]items.py内容（将获取字段转换Field）
[4]pipelines.py内容（xiaoshupPipeline(object)、存储文件：open_spider(self, spider)；def process_item(self, item, spider)；def close_spider(self, spider)）
[5]此案例爬取单页内容

三、各文件内容

settings.py

ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
    
    
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}

ITEM_PIPELINES = {
    
    
   'xiaoshuo.pipelines.XiaoshuoPipeline': 300,
}

biquge.py

[ ]仅爬取了小说章节名和章节链接，试图进一步爬取内容时出现503 Service Unavailable，尚未解决，用scrapy爬取小说内容非重点按下不表
`import scrapy
from …items import XiaoshuoItem

class BiqugeSpider(scrapy.Spider):
name = ‘biquge’
allowed_domains = [‘paoshuzw.com’]
start_urls = [‘http://www.paoshuzw.com/10/10489/’]

def parse(self, response):
    #获取章节名
    name_list = response.xpath("//dd//text()").getall()
    for name in name_list:
        print(name)
        item = XiaoshuoItem(name=name)
        yield item
    #获取章节链接
    href_list = response.xpath("//dd//@href").getall()
    for href in href_list:
        print(href)
        item = XiaoshuoItem(href=href,name=name)
        yield item`

items.py

import scrapy


class XiaoshuoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    href = scrapy.Field()

import json

class XiaoshuoPipeline:
    def open_spider(self, spider):
        self.fp = open("小说.txt", "w", encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(json.dumps(dict(item), ensure_ascii=False) + "\n")  # 转换中文
        print(item)
        return item

    def close_spider(self, spider):
        self.fp.close()

最终实现效果

另多页爬取案例部分内容

适合多个页面有相同需要爬取的内容

next_href = response.xpath("//a[@id='amore']/@href").get()
        print(next_href)
        #现在仅仅有url一半
        if next_href:
            #判断是否有，否则会陷入死循环
            next_url = response.urljoin(next_href)#自动加域名
            request = scrapy.Request(next_url)#创建request对象
            yield request#如果yield的是item就扔给pipeline如果yield的是request就发送给调度器让它再一次发送请求

总结

上述仅是scrapy初步运用，可用来爬取网站文字信息并存储至指定文件，爬取速度极快

如何进一步重复进入链接即进入该页面下的某链接爬取内容？
存储文件能否更加随心所欲？
biquge.py写得过于简单，而且只是单页，要实现多页爬取可以怎么做？重复yield request只是重复爬取相同规则的内容，要是想爬取不同规则内容的其他页内容怎么办？
scrapy的强大还在后面

scrapy笔记一（scrapy.Spider爬取文字并储存）

前言