之前用BeautifulSoup写过爬虫,这段时间工作需要发布一些就业信息,尝试下Scrapy框架,花了一个网上稍微了解了一下,对Scrapy框架有了一定了解,已经可以爬取到数据,并保存为Json、CSV格式,并顺利写入MySQL,但很多细节还需要进一步了解,使用框架确实省事。下面直接贴过程:
一、安装Scrapy
本来在Linux比较方便,但我电脑里的Ubuntu由于搭建了很多Hadoop相关的东西,不想搞得很混乱,就在Win7下搭建吧,我已经装了Python3.6版本的Anaconda,所以直接在命令行:pip install scrapy,安装了很多依赖包,最后提示缺少Microsoft Visual C++2015,网上下载安装后还是提示twisted安装失败,懒得想了,直接到网上下twisted.whl,然后pip安装一下,scrapy安装成功。
二、爬取应届生求职网数据代码编写
这里主要参考的是scrapy官方文档:http://scrapy-chs.readthedocs.io/zh_CN/latest/intro/overview.html,也参考了很多博客文章。
1.创建项目:
cmd > scrapy startproject jobYJS cmd > cd ./jobYJS cmd > scrapy genspider yinjiesheng_spider s.yingjiesheng.com
生成的源代码目录如下:
2.items.py代码编写
# -*- coding: utf-8 -*- import scrapy class JobyjsItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() companyUrl = scrapy.Field() pubDate = scrapy.Field()
3.yinjiesheng_spider.py代码编写
# -*- coding: utf-8 -*- import scrapy from jobYJS.items import JobyjsItem class YinjieshengSpiderSpider(scrapy.Spider): name = 'yinjiesheng_spider' allowed_domains = ['s.yingjiesheng.com'] base_url = 'http://s.yingjiesheng.com/search.php?word=计算机&area=2124&sort=score&start=' pageToCrawl=10 offset = 0 start_urls = [base_url+str(offset)] def parse(self, response): li_list = response.xpath("//ul[contains(@class,'searchResult')]/li") for li in li_list: item = JobyjsItem() item['title']=li.xpath("./div/h3[contains(@class,'title')]/a/text()").extract()[0] item['companyUrl']=li.xpath("./div/h3[contains(@class,'title')]/a/@href").extract()[0] item['pubDate']=li.xpath("./div/p/span[contains(@class,'r date')]/text()").extract() yield item if self.offset <= self.pageToCrawl*10: self.offset += 10 url = self.base_url + str(self.offset) yield scrapy.Request(url, callback=self.parse)
4.结果保存为json格式的pipelines.py代码:
# -*- coding: utf-8 -*- import json import time class JobyjsPipeline(object): def __init__(self): today =time.strftime('%Y-%m-%d',time.localtime(time.time())) self.f = open('yinjiesheng_job'+today+'.json','a+') def process_item(self, item, spider): content = json.dumps(dict(item),ensure_ascii=False) self.f.write(content) return item def close_spider(self,spider): self.f.close()
5.结果保存为CSV格式的pipeline2CSV.py
import csv import time import itertools import codecs class JobyjsPipeline(object): def process_item(self, item, spider): today =time.strftime('%Y-%m-%d',time.localtime(time.time())) filename = today+'.csv' with open(filename,'a+') as csvfile: writer=csv.writer(csvfile) writer.writerow((item['title'],item['companyUrl'],item['pubDate'])) return item
6.结果存储到MySQL的pipelines2MySQL,py
import pymysql def dbHandle(): conn = pymysql.connect( host = "localhost", user = "root", passwd = "123456", charset = "utf8", use_unicode = False ) return conn class JobyjsPipeline(object): def process_item(self,item,spider): dbObject = dbHandle() cursor = dbObject.cursor() cursor.execute("USE crawl") sql = "INSERT INTO yinjiesheng(title,companyUrl,pubDate) VALUES(%s,%s,%s)" try: cursor.execute(sql,(item['title'],item['companyUrl'],item['pubDate'])) cursor.connection.commit() except BaseException as e: print("错误在这里>>>>>>>>>>>>>",e,"<<<<<<<<<<<<<错误在这里") dbObject.rollback() return item
7.Setting.py代码
BOT_NAME = 'jobYJS' SPIDER_MODULES = ['jobYJS.spiders'] NEWSPIDER_MODULE = 'jobYJS.spiders' ITEM_PIPELINES = { 'jobYJS.pipelines.JobyjsPipeline': 300, 'jobYJS.pipeline2CSV.JobyjsPipeline':400, 'jobYJS.pipelines2MySQL.JobyjsPipeline':500, }
三、测试运行
事先在MySQL中建立了crawl database,并在其中新建了yinjiesheng表,其中包含三个字段
cmd> scrapy crawl yinjiesheng_spider运行后生成了预想的结果,今天就到此,明天继续进一步研究下这个框架,go to bed!