day07
1.response.xpath('xpath表达式')
xpath表达式没有text()则结果为选择器对象
xpath表达式加上text()则结果为选择器文本对象
extract()将列表中所有元素序列化为Unicode字符串
2.MongoDB持久化存储
- settings.py设置相关变量
MONGODB_HOST = 'localhost'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'daomudb'
MONGODB_DOCNAME = "daomubiji" - pipelines.py写程序
import pymongo class DaomuPipeline(object): def __init__(self): host = settings.MONGODB_HOST port = settings.MONGODB_PORT dbName = settings.MONGODB_DBNAME docName = settings.MONGODB_DOCNAME conn = pymongo.MongoClient(host=host,port=port) exec("db=conn."+dbName) exec("self.myset=db."+docName)
-
settings.py中添加项目管道
ITEM_PIPELINES = {'项目名.pipelines.类名':300}
4.MySQL
- settings.py设置相关变量
- pipelines.py中定义相关的类
- settings.py中添加项目管道
5.Scrapy模块方法
yield scrapy.Request(url,callback=解析方法名)
day08
1.如何设置随机User-Agent
- settings.py(用于少量User-Agent切换,不推荐)
- 定义USER_AGENT变量值
- DEFAULT_REQUEST_HEADER={"User-Agent":" ",}
- 设置中间件的方法来实现
- 项目目录中新建user_agents.py,放大量Agent
user_agents = [' ',' ',' ',' ',' ']
- middlewares.py写类RandomUserAgentMiddleware
from 项目名.user_agents import user_agents import random class RandomUserAgentMiddleware(object): def process_request(self,request,spider): request.headers['User-Agent'] = random.choice(user_agents)
- 设置settings.py
DOWNLOADER_MIDDLEWARES = { '项目名.middlewares.RandomUserAgentMiddleware':1}
- 项目目录中新建user_agents.py,放大量Agent
- 直接在middlewares.py中添加类
class RandomUserAgentMiddleware(object): def __init__(self): self.user_agents = [' ',' ',' ', ' '] def process_request(self,request,spider): request.header['User-Agent'] = random.choice(self.user_agents)
2.设置代理(DOWNLOADER MIDDLEWARES)
- middlewares.py中添加代理中间件ProxyMiddleware
class ProxyMiddleware(object): def process_request(self,request,spider): request.meta['proxy'] = "http://180.167.162.166:8080"
- settings.py中添加
DOWNLOADER_MIDDLEWARES = { 'Tengxun.middlewares.RandomUserAgentMiddleware': 543, 'Tengxun.middlewares.ProxyMiddleware' : 250,
3.图片管道:ImagePipeline
- 案例:斗鱼图片抓取案例(手机app)
- 抓取目标
- 图片链接
- 主播名
- 房间号
- 城市
把所有图片保存在/home/tarena/day08/Douyu/Douyu/Images
- 步骤
- 前提:手机和电脑为一个局域网
- Fiddler抓包工具
Connections:Allow remote computers to connect - Win+R:cmd -> ipconfig ->以太网ip地址
- 配置手机
手机浏览器 -> http://ip地址:8888
下载:FiddlerRoot certificate - 安装
设置 -> 更多 -> 从存储设备安装 - 设置手机代理
长按wifi, -> 代理
ip地址:ip地址
端口号:端口号
4.ImagePipeline的使用方法
- pipelines.py中进行操作
- 导入模块
from scrapy.pipelines.images import ImagesPipeline - 自定义类,继承ImagesPipeline
class DouyuImagePipeline(ImagesPipeline): # 重写get_media_requests方法 def get_media_requests(self,item,info): # 向图片URL发起请求,并保存到本地 yield scrapy.Request(url=item['link'])
- 导入模块
- settings.py中定义图片存储路径
IMAGES_STORE = '/home/tarena/day08/Douyu/Douyu/Images'
5.dont_filter参数
scrapy.Request(url,callback=...,dont_filter=False)
dont_filter参数:False ->自动对URL进行去重
True -> 不会对URL进行去重
6.Scrapy对接selenium + phantomjs
- 创建项目:JD
- middlewares.py中添加selenium
- 导模块:from selenium import webdriver
- 定义中间件
class seleniumMiddleware(object): ... def process_request(self,request,info): #注意参数为request的url self.driver.get(request.url)
- settings.py
DOWNLOADER_MIDDLEWARES={"Jd.middleware.seleniumMiddleware":20}
7.Scrapy模拟登陆
- 创建项目:Renren
- 创建爬虫文件
8.机器视觉与 tesseract
- OCR(optical Character Recognition) 光学字符识别
扫描字符:通过字符形状 --> 电子文本,OCR有很多的底层识别库 - tesseract(谷歌维护的OCR识别库,不能import,工具)
- 安装
- windows下载安装包
https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-setup-3.02.02.exe/download
安装完成后添加到环境变量 - Ubuntu:sudo apt-get install tesseract-ocr
- Mac:brew install tesseract
- windows下载安装包
- 验证
终端:tesserat test1.jpg text1.txt - 安装pytesseract模块
python -m pip install pytesseract
#方法很少,就用1个,图片转字符串:image_to_string - Python图片的标准库
from PIL import Image - 示例
- 验证码图片以wb方式写入到本地
- image = Image.open('验证码.jpg')
- s = pytesseract.image_to_string(image)
import pytesseract from PIL import Image image = Image.open("test1.jpg") string = pytesseract.image_to_string(image) print(string)
-
tesseract案例:登录豆瓣网站(验证码输入)
'''02_tesseract登陆豆瓣案例.py''' import requests from lxml import etree import pytesseract from PIL import Image from selenium import webdriver url = "https://www.douban.com/" headers = {"User-Agent":"Mozilla/5.0"} # 先访问网站得到html res = requests.get(url,headers=headers) res.encoding = "utf-8" html = res.text # 用xpath把验证码图片的链接给拿出来 parseHtml = etree.HTML(html) s = parseHtml.xpath('//img[@class="captcha_image"]/@src')[0] # 访问验证码图片链接,得到html(字节流) res = requests.get(s,headers=headers) res.encoding = "utf-8" html = res.content # 把图片保存到本地 with open("zhanshen.jpg","wb") as f: f.write(html) # 把图片->字符串 image = Image.open("test1.jpg") s = pytesseract.image_to_string(image) print(s) # 把这个字符串输入到验证码框中 driver = webdriver.Chrome() driver.get(url) driver.find_element_by_name("captcha-solution").send_keys(s) driver.save_screenshot("验证码输入.png") driver.quit()
- 安装
9.分布式介绍
- 条件
- 多台服务器(数据中心,云服务器)
- 网络带宽
- 分布式爬虫方式
- 主从分布式
- 对等分布式
- scrapy-redis
今日示例
nvj1