Scrapy简介与pycharm的使用

Background

最近要从京东爬取一些评论作为语料，所以要使用爬虫技术。
那么现在有这么多种爬虫技术，该选择哪个呢？
经过调研，我打算采用Scrapy框架，具体原因请看下文。

Scrapy简介

github源码https://github.com/scrapy/scrapy

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

正如官网上github简介所说，Scrapy是一个非常快速的高级的web crawling and scrapy的框架，常常被用于抓取网站信息并且从网页中提取出结构化的数据。他可以被用于各种目的，从数据挖掘到监控和自动化测试。

在我看来，就是这个框架的灵活性较高，能配合其他一些工具使用，比如django来做舆情监测网站，用redis等等。

Scrapy安装

#我使用的anaconda管理python
conda install Scrapy
# 或者直接使用pip
pip install Scrapy

本人是在windows环境下使用的anaconda来管理python环境，Python版本为3.6，ide是pycharm。
以下操作均是在此环境下完成。

Demo

创建一个Scrapy工程

scrapy startproject tutorial

然后用pycharm打开后，目录如下：
项目目录结构

tutorial/
    tutorial/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py
    scrapy.cfg            # deploy configuration file

在tutorial/tutorial/spiders/目录下创建文件quotes_spider.py

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split('/')[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

在pycharm下运行，需要进行配置
1、首先创建一个begin.py
这里写图片描述
2、配置运行环境
点击Edit Configurations

+python
这里写图片描述

修改参数
Name:改为spider；script: 选择刚才新建的begin.py文件；
这里写图片描述

在begin.py中写入代码：

from scrapy import cmdline
cmdline.execute("scrapy crawl quotes".split())

最后点击运行就ok了，也可以点击debug

这里写图片描述