scrapy学习(二)：scrapy+splash抓动态js页面(京东商品信息)

前面我们学习了简单的scrapy框架，一般的静态网页都比较容易爬到数据，但是现在为了加快页面的加载，页面的很多部分都是用JS生成的，而对于用scrapy爬虫来说就是一个很大的问题，比如最明显的就是购物网站如京东淘宝等。对于使用js动态加载的页面如何爬取呢？
对于动态js页面的爬取有以下几种爬取的方式：

selenium+webdriver+scrapy
这种方法要求系统有对应浏览器，并且过程中要全程开浏览器。也就是说你通过浏览器能看到什么就能抓到神魔。一般遇到特别复杂的验证码时，这个方法是有必要的，但是，一直开着浏览器，爬虫的效率就可想而知了。
selenium+phantomjs
PhantomJS是一个WebKit，他的使用方法和webdriver一样，但是他不需要开浏览器，可以直接跑在无需GUI的linux服务器上。
Scrapy-splash框架：
splash作为js渲染服务，是基于Twisted和QT开发的轻量浏览器引擎，并且提供直接的http api。快速、轻量的特点使其容易进行分布式开发。splash和scrapy融合，两种互相兼容彼此的特点，抓取效率较好。

最重要的是，前两种对selenium、webdriver和浏览器的版本要求很高，必须相互匹配，一般安装这些比较麻烦，所以不建议使用，我们这里主要讲一下第三种的使用。

splash安装

我这里使用的是Linux，可以使用命令或者源码安装。

命令安装

[john@localhost Tools]$ pip install scrapy-splash

源码安装
https://pypi.org/project/scrapy-splash/#files

[john@localhost Tools]$ tar -xvf scrapy-splash-0.7.2.tar.gz 
[john@localhost Tools]$ cd scrapy-splash-0.7.2
[john@localhost scrapy-splash-0.7.2]$ sudo python setup.py install

docker安装

安装启动
scrapy-splash使用的是Splash HTTP API，所以需要一个splash instance，一般采用docker运行splash，所以需要安装docker，docker安装业比较简单，在菜鸟教程上讲的很清楚，这里我也不再详细讲解了，不知道看下面的教程链接自己摸索一下。
docker安装
拉取镜像

[root@localhost Tools]# docker pull scrapinghub/splash

latest: Pulling from scrapinghub/splash
b81fcfb5a085: Already exists 
ab7c210f9795: Already exists 
c2cf02f41303: Already exists 
0a8cd4ae9871: Already exists 
c98893619a46: Already exists 
8d6194e9fee4: Already exists 
5b524c6a6b81: Already exists 
798fca4cf1a6: Already exists 
8ab204d6c874: Already exists 
d8037c666db8: Already exists 
8050692155c7: Already exists 
0b2a126cacec: Already exists 
ace330c5a49e: Already exists 
4f4c3665c0f8: Already exists 
b1ebd117d6e3: Already exists 
49bfaba724a3: Already exists 
ab38ca7fb700: Already exists 
Digest: sha256:5b3c838935a5bb0533b270c71eaeedc3c9cc57c41708c9a89017494ca308413f
Status: Image is up to date for scrapinghub/splash:latest

出现上面的log时表示镜像拉取成功。

运行splash服务

[root@localhost Tools]# docker run -p 8050:8050 scrapinghub/splash

京东商品信息爬取

上面准备了这么多，这里我们就进入正题，开始我们的工作了，因为楼主对酒有那么点兴趣，所以这里我们就来爬京东里面的酒类信息。

创建爬虫
创建爬虫的步骤与配置上篇介绍的很详细，这里我就不介绍了。
配置splash服务
splash服务配置在官网上也介绍的很详细，但是在这里我依然贴一下，下面的操作都是在setting.py中。
splash服务配置

[john@localhost JD]$ vim settings.py

#splash服务器地址
SPLASH_URL = 'http://127.0.0.1:8050'

#Enable SplashDeduplicateArgsMiddleware
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

#将splash middleware添加到DOWNLOADER_MIDDLEWARE中
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

#Set a custom DUPEFILTER_CLASS
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

#a custom cache storage backend
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

获取商品信息

现在按我们之前的方式写wine.py

[john@localhost spiders]$ vim wine.py

 # -*- coding: utf-8 -*-
import scrapy
from JD.items import JdItem

class WineSpider(scrapy.Spider):
    name = 'wine'
    allowed_domains = ['jd.com']
    start_urls = ['https://search.jd.com/Search?keyword=%E9%85%92%E7%B1%BB&enc=utf-8]

    def parse(self, response):
        wines = response.xpath('//ul[@class="gl-warp clearfix"]/li/div[@class="gl-i-wrap"]')
        for wine in wines:        
            item = JdItem()
            item['name'] = wine.xpath('./div[@class="p-name p-name-type-2"]/a[@target="_blank"]//em/text()').extract_first()

            print item['name'].strip() 

            yield item

这个时候我们按照之前的方式爬信息，会发现一页只能爬出来30条数据，翻来覆去，思前想后，仔细观察，才发现了是url有一点小不同，发现page是动态加载，而不能像前面古诗词一样直接叠加，那我们就试试使用splash。
URL

 # -*- coding: utf-8 -*-
import scrapy
from JD.items import JdItem
from scrapy_splash import SplashRequest

class WineSpider(scrapy.Spider):
    name = 'wine'
    allowed_domains = ['jd.com']

    def start_requests(self):
        for i in range(1, 2):
            url = 'https://search.jd.com/Search?keyword=%E9%85%92%E7%B1%BB&enc=utf-8&page=' + str(i*2 -1)
            yield SplashRequest(url, callback=self.parse, args={'wait': '0.5'})

    def parse(self, response):
        wines = response.xpath('//ul[@class="gl-warp clearfix"]/li/div[@class="gl-i-wrap"]')
        for wine in wines:        
            item = JdItem()
            item['name'] = wine.xpath('./div[@class="p-name p-name-type-2"]/a[@target="_blank"]//em/text()').extract_first()

            print item['name'].strip() 

            yield item

再次跑一遍，发现还是只能抓30条数据，再仔细找找看，突然发现html一开始只能显示30条商品信息，随着鼠标下滑，变成了60条商品信息，看到这里才找到根本原因，重新修改一下我们的代码，利用splash的lua脚本实现js的操作：

# -*- coding: utf-8 -*-
import scrapy
from JD.items import JdItem
from scrapy_splash import SplashRequest

class WineSpider(scrapy.Spider):
    name = 'wine'
    allowed_domains = ['jd.com']

    def start_requests(self):
        script = '''
            function main(splash)
                splash:set_viewport_size(1028, 10000)
                splash:go(splash.args.url)
                local scroll_to = splash:jsfunc("window.scrollTo")
                scroll_to(0, 2000)
                splash:wait(3)

                return { 
                    html = splash:html() 
                }
            end
          '''

        for i in range(1, 2):
            url = 'https://search.jd.com/Search?keyword=%E9%85%92%E7%B1%BB&enc=utf-8&page=' + str(i*2 -1)
            yield SplashRequest(url, callback=self.parse, meta = {
                'dont_redirect': True,
                'splash':{
                    'args': {                                                                                                                                                                                                                            
                        'lua_source':script,'images':0
                    },
                    'endpoint':'execute',
                }
            })


    def parse(self, response):
        wines = response.xpath('//ul[@class="gl-warp clearfix"]/li/div[@class="gl-i-wrap"]') 
        for wine in wines:
            item = JdItem()
            item['name'] = wine.xpath('./div[@class="p-name p-name-type-2"]/a[@target="_blank"]//em/text()').extract_first()

            print item['name'].strip() 

            yield item

再次运行，诶，突然发现抓了60条，表示我们已经成功的获得了一页完整的商品信息。

然后将page换成100页，爬京东上所有的酒水，结果如下：
爬取结果

源码

京东商品信息