前言:
1、爬取网站:梨视频
2、说明:该网站属于商业网址,本案例仅用于学习测试,不用于其他用途。
3、技术路线:requests+re+os
4、代码
'''
爬梨视频网站,下载视频保存到本地
version:01
author:金鞍少年
date:2020-03-19
'''
import requests
import re
import os
class PearVideo:
def __init__(self):
self.url = 'https://www.pearvideo.com/popular'
self.path =r'.\video'
self.headers = {
"Referer": "https://www.pearvideo.com/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
}
# 去掉特殊字符,方便文件名保存
def clean_zh_text(self, text):
# keep English, digital and Chinese
comp = re.compile('[^\u4e00-\u9fa5]')
return comp.sub('', text)
# 获取HTMl
def getHTMl(self,url):
re = requests.get(url, headers=self.headers)
if re.status_code == 200:
return re.text
# 获取页面url
def get_Pageurl(self,Html):
urls = []
Page_id = re.findall('<a href="(.*?)" class="popularembd actplay">',Html)
for i in Page_id:
url = r'https://www.pearvideo.com/'+ i
urls.append(url)
return urls
# 将视频保存到本地
def Download_video(self, urls):
for url in urls:
html = self.getHTMl(url)
video_url = re.findall(',sdUrl="",ldUrl="",srcUrl="(.*?)",', html)[0]
video_name = re.findall('data-type="2" data-title="(.*?)" ', html)[0]
name = self.clean_zh_text(video_name) # 去掉视频名特殊字符
path = os.path.join(self.path, name) # 拼接保存地址
res = requests.get(video_url)
with open(path+'.pm4', 'wb')as f:
f.write(res.content)
print('下载:%s 视频成功!'% name)
# 业务逻辑
def fun(self):
self.Download_video(self.get_Pageurl(self.getHTMl(self.url)))
if __name__ == '__main__':
g = PearVideo()
g.fun()
print('下载任务完成!')
总结:
1、不足:目前python爬虫功力还是不足,做不到爬取动态内容,加油奥利给!
2、小亮点:封装了一个功能,用正则表达式做文本预处理,去掉标题特殊符号,不然系统还不好保存视频