基于beautifulsoup4爬取壁纸
beautifulsoup4常用来爬取网页。由于不想下载一个壁纸软件,又想达到自动更换壁纸的效果,因此就写了这个爬虫。
分析待爬取网站
进入wallpaperup之后有一个搜索框,显然可以用无关键词的方法搜索所有图片,同时,搜索框下有不同的筛选项,这将有助于我们分析接下来的搜索结果网址。
网址分析
- 无关键词、无筛选项搜索,搜索结果url为https://www.wallpaperup.com/most/popular
- 逐项添加筛选项,指定分辨率时搜索结果为https://www.wallpaperup.com/resolution/2560/1440,再添加长宽比为16:9后,搜索结果url为https://www.wallpaperup.com/search/results/ratio:1.78+resolution:2560x1440,据此可以发现搜索结果默认以热度popularity排序,多项筛选时的搜索结果url应为https://www.wallpaperup.com/search/results/ k e y 1 : {key1}: key1:{value1}+ k e y 2 : {key2}: key2:{value2}的格式。
- 第二点仅为推测,接下来需要确认仅一个筛选项时也能用第二点的搜索结果url格式的url进行筛选。https://www.wallpaperup.com/search/results/resolution:2560x1440可访问。
- 爬虫肯定不能只爬取一页,翻到下一页,网址变化为在末尾追加"/2",据此推测访问搜索结果的第x页的网址为${search_url}/x。
至此,第一阶段的网址分析告一段落
待爬取内容分析
- 待爬取的内容是搜索结果里的图片,因此右键选中任意一张图片并选择检查,一张图片的完整html元素为
<div class="thumb-adv " data-ratio="1.7777777777778" style="width: 513px; height: 288.562px; top: 0px; left: 0px;"><figure class="black"><a href="/9024/Clouds_cityscapes_architecture_buildings_skyscrapers.html" title="View wallpaper" class="thumb-wrp" style="height:0;padding-bottom:56.325301204819%;"><img width="2560" height="1440" class="thumb black lazy " data-src="https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg" data-srcset="https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-500.jpg 889w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-375.jpg 667w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-250.jpg 444w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg 332w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-125.jpg 222w" alt="Clouds cityscapes architecture buildings skyscrapers wallpaper" data-wid="9024" data-group="gallery" sizes="513px" srcset="https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-500.jpg 889w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-375.jpg 667w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-250.jpg 444w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg 332w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-125.jpg 222w" src="https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg"></a><figcaption class="attached-bottom on-hover compact"><div class="sections center-y"><div class="section-left"><div class="button manage compact xsmall transparent" title="Manage"><i class="icon"></i></div></div><span class="section-center forced" title="Resolution">2560x1440</span><div class="section-right"><div class="favoriter subscriber no-remote-state no-label multiple joined no-separators center-x" data-state="null" data-state-batch-url="/favorite/get_states_batch/9024+42467+248989+15719+218610+12857+104187+52269+100670+667839+234960+172181+917952+19472+104183+156805+243384+54956+232856+9634+67838+189500+172180+675992" data-state-url="/favorite/get_state/9024" data-url="/favorite/do_toggle/9024"><div title="Add to favorites" class="toggle button bordered compact xsmall transparent" data-text-on="Favorite" data-text-off="Favorite" data-icon=""></div><div class="button bordered compact xsmall transparent remote-modal-trigger" data-url="/favorite/move_modal/9024" title="Move favorite"><i class="icon"></i></div></div></div></div></figcaption></figure></div>
和
制定爬取策略
根据网站内容分析可以确定两个策略来进行爬取:1. 在搜索结果页提取所有的data-src并且替换掉"-数字"作为爬取结果;2. 先记录所有的详情页,再从每个详情页中获取data-origin。
代码实现
生成url
用字典记录筛选项及值,
# 设置筛选条件
option = {
'cats_ids': '', # 类别,自行修改为需要的类别的编号
# 'cats_ids': '1', # example
'license': '', # 版权,自行修改为需要的版权的编号
# 'license': '1', # example
'ratio': '', # 长宽比,
# 'ratio': str(round(16/9, 2)), # example 这里是16:9
# 'resolution_mode': '', # 分辨率筛选模式
'resolution_mode': ':', # 这里是at least
# 'resolution_mode': ':=:', # 这里是exactly
# 'resolution': '', # 分辨率,
'resolution': '2560x1440', # example 这里是2560*1440
'color': '',
# 'color': '#80c0e0', # 颜色,自行修改颜色的编码
'oder': '', # 自行根据需要修改
}
在根据字典生成url时,要注意到两个特殊的点:
- 分辨率筛选时,有exact和at least两种模式,在生成url时的区别在于resolution和值中间的运算符。
- 长宽比采用的是计算结果,比如16:10的值是1.6。对比https://www.wallpaperup.com/search/results/resolution:2560x1440+order:date_added/1和https://www.wallpaperup.com/search/results/resolution:=:2560x1440+order:date_added/1。
根据字典生成搜索结果url
url = 'https://www.wallpaperup.com/search/results/'
for key in option.keys():
if option[key] == '' or key == 'resolution_mode':
pass
elif option[key] == 'resolution':
url += (key + option['resolution_mode'] + option[key])
else:
url += (key + ':' + option[key])
url += "+"
url = url[:-1]+"/"+str(page_num)
爬取图片
对于第一种策略,解析到data-src再使用正则替换就行了,不再多说。
第二种策略,获取搜索结果的每一项详情页url的代码:
r_url = requests.get(url, headers=headers)
soup_a = BeautifulSoup(r_url.text, 'lxml')
r_url.close() # 关闭连接以防止被拒绝
wallpaper_pages = []
for links in soup_a.find_all(attrs={
'title': "View wallpaper"}):
soup2 = BeautifulSoup(str(links), 'lxml')
wallpaper_pages.append(
'https://www.wallpaperup.com'+str(soup2.a.attrs['href']))
获取详情页的data-origin
wallpaper_imgaes_link = []
for page in wallpaper_pages:
r_page = requests.get(page, headers=headers)
soup_page = BeautifulSoup(r_page.text, 'lxml')
div_img = str(
(soup_page.find_all(attrs={
'class': 'thumb-wrp'}))[0])
soup_div_img = BeautifulSoup(div_img, 'lxml')
wallpaper_imgaes_link.append(
str(soup_div_img.div.img.attrs['data-original']))
将图片下载到本地
filename = str(re.compile(r"\d{4}\/\d{2}\/\d{2}\/\d+").findall(
image_link)[0]).replace("/", "", 2).replace("/", "_")+".jpg"
print("filename=", filename)
with open(wallpapers_folder+filename, 'wb+') as f:
f.write(r_image_link.content)
至此,可以获取每一个搜索结果页的所有图片,只需在外层嵌套一个循环就可以爬取多页的图片了。
环境说明
python version: python3
dependencies: bs4, requests, lxml