用python爬京东商品(二)
scrapy解析的是静态的网页内容,所以网页中利用js动态赋值的内容是抓取不到的,例如像京东的价格
这个时候就需要派上splash这个利器了
一、环境
scrapy_splash + docker
- 安装scrapy_splash
pip install splash - 安装docker
docker是依赖于linux环境的,所以windows需要安装虚拟机,mac的话可以取官网下载安装包(或者使用brew cask install docker)
安装完毕后,可以查看docker的版本
docker --version
- 在docker下安装splash镜像
在安装目录下有一个找到daemon.json,修改为下面的地址,下载速度会快点
{
“registry-mirrors”: [“http://hub-mirror.c.163.com”]
}
执行:
docker pull scrapinghub/splash
- 在docker上运行splash
docker run -p 8050:8050 scrapinghub/splash
- 浏览器查看
二、具体实现
splash相当于在一个做了代理,或者说包装了一层,帮我们把动态的数据全部加载到网页内
- 编写product_spider.py
request需要封装成SplashRequest
核心代码:
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': '0.5'})
- 自动爬下一页
核心代码:
next_page = response.xpath('//a[@class="pn-next"]//@href').extract_first()
if next_page is not None:
next_page = "https://list.jd.com" + next_page
print(next_page)
next_page = response.urljoin(next_page)
# yield scrapy.Request(next_page, callback=self.parse)
yield SplashRequest(next_page, self.parse, args={'wait': '0.5'})
- 编写ProductItem
核心代码:
class ProductItem(scrapy.Item):
id = scrapy.Field()
name = scrapy.Field()
brandId = scrapy.Field()
shopId = scrapy.Field()
imgUrl = scrapy.Field()
price = scrapy.Field()
detailPage = scrapy.Field()
- 编写JdProductPipeline
核心代码:
class JdProductPipeline(object):
def __init__(self):
client = pymongo.MongoClient('127.0.0.1', 27017)
db = client['jd']
self.post = db['product']
def process_item(self, item, spider):
if isinstance(item, ProductItem):
postItem = dict(item)
self.post.insert(postItem)
print('product')
return item
#配置splash的监听地址:
SPLASH_URL = 'http://127.0.0.1:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
#配置downloader_middlewares:
DOWNLOADER_MIDDLEWARES = {
'jdproject.middlewares.JdprojectDownloaderMiddleware': 543,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
#配置SPIDER_MIDDLEWARES:
SPIDER_MIDDLEWARES = {
'jdproject.middlewares.JdprojectSpiderMiddleware': 543,
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
#配置ITEM_PIPELINES:
ITEM_PIPELINES = {
'jdproject.pipelines.JdProductPipeline': 300,
#'jdproject.pipelines.JdprojectPipeline': 300,
}
三、源码地址
https://github.com/jieYW/jdproject
四、官方地址
https://splash.readthedocs.io/en/stable/
五、效果