1.安装python和pip()
2.命令行中运行,从豆瓣源上下载,速度比较快
pip install -i https://pypi.doubanio.com/simple/ virtualenv
3.新建scrapytest的虚拟环境 ,命令行中进入相应目录
D:\Test>virtualenv scrapytest
4.运行activate.bat文件
D:\Test\scrapytest\Scripts>activate.bat
5.运行python
(scrapytest) D:\Test\scrapytest\Scripts>python
6.退出虚拟环境
(scrapytest) D:\Test\scrapytest\Scripts>deactivate.bat
7.创建虚拟环境时指明python版本(补充),指定某个路径下的python.exe
D:\Test> virtualenv ___\python.exe scrapytest
8.安装virtualenvwrapper(辅助执行虚拟环境)
D:\Test\scrapytest\Scripts>pip install virtualenvwrapper-win9.运行workon
D:\Test\scrapytest\Scripts>workon(查看运行中的虚拟环境)
10.配置WORKON_HOME(提供给workon)
D:\Envs11.创建virtualenv
D:\Envs>mkvirtualenv py3scrapy12.退出虚拟环境
(py3scrapy) C:\Users\Administrator>deactivate13.启动虚拟环境可以通过workon,此时不需要指定路径
C:\Users\Administrator>workon py3scrapy14.安装软件如requests
C:\Users\Administrator>pip install requests15.如果安装失败,可以在http://www.lfd.uci.edu/~gohlke/pythonlibs/这个网站上去下载安装(基本为windows版本)
pip install -i https://pypi.doubanio.com/simple/ scrapy
安装出错,手动去上述网站下载安装,下载好后,进入该文件下载地址通过workon py3scrapy进入虚拟环境,通过命令pip install安装
C:\Users\Administrator\Downloads>workon py3scrapy (py3scrapy) C:\Users\Administrator\Downloads>pip install lxml-4.0.0-cp34-cp34m-win32.whl如果出现twisted安装错误去网上 http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 如上解决
16.创建scrapy项目
如项目创建在 D:\workspace\PycharmProjs 中,需在cmd命令中进入该目录,输入命令workon py3scrapy进入虚拟环境
创建scrapy项目,并将项目导入到pycharm中
(py3scrapy) D:\workspace\PycharmProjs>scrapy startproject ArticleSpider
17.生成默认爬虫代码
(py3scrapy) D:\workspace\PycharmProjs\ArticleSpider>scrapy genspider jobbole blog.jobbole.com
18.命令行运行出现错误
(py3scrapy) D:\workspace\PycharmProjs\ArticleSpider>scrapy crawl jobbole ImportError: No module named 'win32api'
安装即可,注意将settings.py中的 ROBOTSTXT_OBEY = False,值要改为False,即不遵守robot协议
pip install -i https://pypi.douban.com/simple pypiwin32
19.通过shell命令对网页进行调试
(py3scrapy) D:\workspace\PycharmProjs\ArticleSpider>scrapy shell http://blog.jobbole.com/112517/ # 通过xpath获取 >>> title = response.xpath("//*[@id='post-112517']/div[1]/h1/text()") >>> title [<Selector xpath="//*[@id='post-112517']/div[1]/h1/text()" data='Linux 大爆炸:一个内核,无数发行版'>] >>> title.extract() ['Linux 大爆炸:一个内核,无数发行版'] >>> title.extract()[0] 'Linux 大爆炸:一个内核,无数发行版' >>> create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']") >>> create_date [<Selector xpath="//p[@class='entry-meta-hide-on-mobile']" data='<p class="entry-meta-hide-on-mobile">\r\n\r'>] >>> response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract() ['\r\n\r\n 2017/09/19 · ', '\r\n \r\n \r\n\r\n \r\n · ', '\r\n \r\n'] >>> response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0]. strip().replace("·","").strip() '2017/09/19' >>> response.xpath("//span[contains(@class, 'vote-post-up')]") [<Selector xpath="//span[contains(@class, 'vote-post-up')]" data='<span data-post-id="112517" class=" btn-'>] >>> int(response.xpath("//span[contains(@class, 'vote-post-up')]/h10/text()").extract()[0]) 1 >>> response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").extract()[0] ' 收藏' >>> response.xpath("//a[@href='#article-comment']/span").extract()[0] '<span class="btn-bluet-bigger href-style hide-on-480"><i class="fa fa-comments-o"></i> 评论</span>' >>> response.xpath("//div[@class='entry']").extract()[0] 获取正文内容 # 通过css获取 >>> response.css(".entry-header h1") [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalie-space(@class), ' '), ' entry-header ')]/descendant-or-self::*/h1" data='<h1>Lnux 大爆炸:一个内核,无数发行版</h1>'>] >>> response.css(".entry-header h1").extract() ['<h1>Linux 大爆炸:一个内核,无数发行版</h1>'] >>> response.css(".entry-header h1::text").extract() ['Linux 大爆炸:一个内核,无数发行版'] >>> response.css("p.entry-meta-hide-on-mobile::text").extract()[0].strip().replace("·","").strip() '2017/09/19 ·' >>> response.css(".vote-post-up h10::text").extract()[0] '1' >>> response.css("span.bookmark-btn::text").extract()[0] ' 收藏' >>> response.css("a[href='#article-comment'] span::text").extract()[0] ' 1 评论' >>> response.css("div.entry").extract()[0] '内容'
20.下载图片时需要安装pillow插件
(py3scrapy) C:\Users\Administrator>pip install -i https://pypi.douban.com/simple pillow
21.安装连接mysql的插件mysqlclient
(py3scrapy) C:\Users\Administrator>pip install -i https://pypi.douban.com/simple mysqlclient