版权声明:欢迎转载,请标明文章出处 https://blog.csdn.net/IT_arookie/article/details/82874541
scrapy框架爬取内容详细介绍:
scrapy: python开发的一个快速、高层次的屏幕抓取和web抓取框架,简单,方便,易上手
一、scrapy 的工作流程
1、引擎从调度器中取出一个URL链接(url)用来接下来的爬取
2、引擎把URL封装成一个Request请求传给下载器,下载器把资源下下来,并封装成应答包Response
3、爬虫解析Response
4、若是解析出实体(Item),则交给实体管道(pipelines)进行进一步的处理。
5、若是解析出的是链接(URL),则把URL交给Scheduler等待抓取
二、Scrapy主要文件作用:
1、Items是将要装载抓取的数据的容器(存放变量)。
2、Spider是用户编写的类,用于从一个域(或域组)中抓取信息(爬虫文件)。
3、 pipelines :项目管道文件,用于提取Items内容,储存成不同的文件格式
4、 settings: 项目配置文件,各种设置启动项
下面以爬取51job内容详细介绍scrapy框架的详细写法;
项目实战1:简单爬取51job职位信息
需要scrapy模块(建议安装adaconda)
在cmd中,与特定存放代码的路径下执行,
scrapy startproject job51
1.首先编辑items.py文件
设置几个需要的变量容器
import scrapy
#from scrapy import Item,Field
class Job51Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
jobName = scrapy.Field()
companyName = scrapy.Field()
address = scrapy.Field()
money = scrapy.Field()
ptime = scrapy.Field()
2.在spiders文件夹内新建一个job51.py爬虫文件
基本全是固定的框架写法,向里面填内容即可
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from ..items import Job51Item #导入items文件里的类
class Job51(CrawlSpider): #继承类,固定写法
name = 'job51' #名字为文件名,固定写法
start_urls =[ #网址变量列表
'https://search.51job.com/list/000000,000000,0000,00,9,99,%25E4%25BA%25BA%25E5%25B7%25A5%25E6%2599%25BA%25E8%2583%25BD,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=4&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='
]
def parse(self, response): #爬取函数,固定写法
selector = Selector(response)
item = Job51Item()
divs = selector.xpath('//div[@id="resultList"]//div[@class="el"]')
for each in divs:
jobName = each.xpath('./p/span/a/@title').extract()
companyName =each.xpath('./span[1]/a/text()').extract()
address = each.xpath('./span[2]/text()').extract()
money = each.xpath('./span[3]/text()').extract()
ptime = each.xpath('./span[4]/text()').extract()
print(jobName,address,companyName,money,ptime)
item['jobName'] = jobName[0]
item['companyName'] = companyName[0]
item['address'] = address[0]
if money:
item['money'] = money[0]
else:
item['money'] = '面谈'
item['ptime'] = ptime[0]
yield item #提交变量
3.在pipeline.py中编辑管道文件,存为不同格式的文件(txt,excel,csv,json,mongodb,mysql)
固定语法:
class name(object):
def __init__(self):
pass
def process_item(self,item,spider):
pass
return item
def close_spider(self,spider):
pass
保存到excel
from openpyxl import Workbook
class saveToExcel(object):
def __init__(self):
self.wb=Workbook()
self.ws = self.wb.active
self.ws.append(['职位名','公司名','工作地点','薪资','发布日期'])
def process_item(self,item,spider):
# item 是一个字典形式
self.ws.append(list(dict(item).values()))
def close_spider(self,spider):
self.wb.save('人工智能.xlsx')
保存到csv(默认的存储方式):字段名是自动排列的
#默认保存原样就可以,但在settings内启动有特殊的写法
class DoubanPipeline(object):
def process_item(self, item, spider):
return item
保存到csv的另一种写法
import csv
import codecs #消除空行的一种方法
class saveToCsv(object):
def __init__(self):
with codecs.open('人工智能.csv', 'w', encoding="utf-8") as csvfile:
self.write1 = csv.writer(csvfile)
# 文件写入,写入列表
self.write1.writerow(['职位名', '公司名', '工作地点', '薪资', '发布时间'])
def process_item(self, item, spider):
#item是一个字典
with codecs.open('人工智能.csv', 'a', encoding="utf-8") as csvfile:
self.write1 = csv.writer(csvfile)
# 文件写入,写入列表
self.write1.writerow([jobName,companyName,address,money,ptime])
return item
def close_spider(self,spider):
pass
保存为json格式
import json
from pymongo import MongoClient
class saveToJson(object):
def __init__(self):
self.film = open('job51.json','w',encoding = 'utf-8')
def process_item(self,item,spider):
#把每个item转换成json
echo = json.dumps(dict(item),ensure_ascii=False) #不是ascii码
self.film.write(echo+'\n')
return item
def close_spider(self,spider):
self.film.close()
#saveToMongondb()
保存到Mongodb
class saveToMongodb(object):
def __init__(self):
conn = MongoClient('localhost') #链接到本地
db = conn.newdb #打开数据库
self.col = db.newjob51 #打开(新建)集合
self.col.remove(None) #清空集合之前的内容(如果有数据的话)
def process_item(self,item,spider):
self.col.insert_one(dict(item))
def close_spider(self,spider):
print('存储结束')
4.可以在middlewares .py(中间件)里设置浏览器头部和代理
首先准备头部和代理ip列表
两个列表可以放在本 文件里面,也可以放到settings.py里面,建议放在settings里面,但是需要导入列表到本文件内。
user_agent = [ #准备头部,列表
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UCWEB7.0.2.37/28/999",
"NOKIA5700/ UCWEB7.0.2.37/28/999",
"Openwave/ UCWEB7.0.2.37/28/999",
"Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
# iPhone 6:
"Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
]
ips = [
'HTTP://116.1.11.19:80',
'HTTPS://140.207.50.246:51426',
'HTTP://118.178.227.171:80',
'HTTP://118.190.95.43:9001',
'HTTP://61.135.217.7:80',
'HTTP://106.75.225.83:808',
'HTTPS://106.75.226.36:808',
'HTTP://118.190.95.35:9001',
'HTTPS://123.207.30.131:80',
'HTTPS://60.12.89.218:57299',
'HTTPS://124.235.135.74:80',
'HTTPS://59.45.16.10:59156',
'HTTP://218.23.124.52:59361',
'HTTP://124.234.157.228:80',
'HTTP://58.51.83.102:808',
'HTTP://110.73.42.11:8123',
'HTTPS://222.242.155.69:40919',
'HTTP://110.73.10.32:8123',
'HTTPS://220.172.40.190:80',
'HTTPS://60.211.192.54:40700',
'HTTP://182.88.135.132:8123',
'HTTPS://122.227.182.102:33174',
'HTTPS://219.139.35.70:42993',
'HTTPS://222.76.204.110:808',
'HTTPS://222.245.165.154:36522']
代理ip列表可能已经失效了,可以不用加入代理:
然后在process_request()方法内添加两句话即可
def process_request(self, request, spider):
import random
#设置浏览器头部
request.headers['User-Agent'] = random.choice(user_agent)
#设置代理
#request.meta['proxy'] = random.choice(ips)
return None
5.settings.py文件
以上所有的配置文件,都必须在settings.py内启动才可以生效
BOT_NAME = 'job51'
SPIDER_MODULES = ['job51.spiders']
NEWSPIDER_MODULE = 'job51.spiders'
#如果是默认保存csv(没改动过pipeline.py文件),要写入下面两行
# FEED_URI='51job.csv'
# FEED_FORMAT='CSV'
#下载延迟,一般要打开
DOWNLOAD_DELAY = 2
#打开管道开关
ITEM_PIPELINES = {
'job51.pipelines.saveToCsv': 300,
'job51.pipelines.saveToExcel': 310,
'job51.pipelines.saveToJson': 320,
'job51.pipelines.saveToMangodb': 330,
}
#启动中间件,使浏览器头部生效
DOWNLOADER_MIDDLEWARES = {
'job51.middlewares.Job51DownloaderMiddleware': 543,
}
6.最后启动爬虫程序
方法一:
在有scrapy.cfg文件的目录下执行cmd;运行:
scrapy crawl job51
方法二:
在pycharm中运行
方法三:
在有scrapy.cfg文件的目录下建一个main.py文件
from scrapy import cmdline
cmdline.execute('scrapy crawl job51'.split())
运行这个文件就行