python基于beautifulsoup4爬取wallpaperup的壁纸

这里写自定义目录标题

基于beautifulsoup4爬取壁纸

基于beautifulsoup4爬取壁纸

beautifulsoup4常用来爬取网页。由于不想下载一个壁纸软件，又想达到自动更换壁纸的效果，因此就写了这个爬虫。

分析待爬取网站

进入wallpaperup之后有一个搜索框，显然可以用无关键词的方法搜索所有图片，同时，搜索框下有不同的筛选项，这将有助于我们分析接下来的搜索结果网址。

网址分析

无关键词、无筛选项搜索，搜索结果url为https://www.wallpaperup.com/most/popular
逐项添加筛选项，指定分辨率时搜索结果为https://www.wallpaperup.com/resolution/2560/1440，再添加长宽比为16:9后，搜索结果url为https://www.wallpaperup.com/search/results/ratio:1.78+resolution:2560x1440，据此可以发现搜索结果默认以热度popularity排序，多项筛选时的搜索结果url应为https://www.wallpaperup.com/search/results/ ${key1}:$ {value1}+ ${key2}:$ {value2}的格式。
第二点仅为推测，接下来需要确认仅一个筛选项时也能用第二点的搜索结果url格式的url进行筛选。https://www.wallpaperup.com/search/results/resolution:2560x1440可访问。
爬虫肯定不能只爬取一页，翻到下一页，网址变化为在末尾追加"/2"，据此推测访问搜索结果的第x页的网址为${search_url}/x。

至此，第一阶段的网址分析告一段落

待爬取内容分析

待爬取的内容是搜索结果里的图片，因此右键选中任意一张图片并选择检查，一张图片的完整html元素为

<div class="thumb-adv " data-ratio="1.7777777777778" style="width: 513px; height: 288.562px; top: 0px; left: 0px;"><figure class="black"><a href="/9024/Clouds_cityscapes_architecture_buildings_skyscrapers.html" title="View wallpaper" class="thumb-wrp" style="height:0;padding-bottom:56.325301204819%;"><img width="2560" height="1440" class="thumb black    lazy " data-src="https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg" data-srcset="https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-500.jpg 889w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-375.jpg 667w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-250.jpg 444w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg 332w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-125.jpg 222w" alt="Clouds cityscapes architecture buildings skyscrapers wallpaper" data-wid="9024" data-group="gallery" sizes="513px" srcset="https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-500.jpg 889w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-375.jpg 667w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-250.jpg 444w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg 332w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-125.jpg 222w" src="https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg"></a><figcaption class="attached-bottom on-hover compact"><div class="sections center-y"><div class="section-left"><div class="button manage compact xsmall transparent" title="Manage"><i class="icon"></i></div></div><span class="section-center forced" title="Resolution">2560x1440</span><div class="section-right"><div class="favoriter subscriber no-remote-state no-label multiple joined no-separators center-x" data-state="null" data-state-batch-url="/favorite/get_states_batch/9024+42467+248989+15719+218610+12857+104187+52269+100670+667839+234960+172181+917952+19472+104183+156805+243384+54956+232856+9634+67838+189500+172180+675992" data-state-url="/favorite/get_state/9024" data-url="/favorite/do_toggle/9024"><div title="Add to favorites" class="toggle button bordered compact xsmall transparent" data-text-on="Favorite" data-text-off="Favorite" data-icon=""></div><div class="button bordered compact xsmall transparent remote-modal-trigger" data-url="/favorite/move_modal/9024" title="Move favorite"><i class="icon"></i></div></div></div></div></figcaption></figure></div>

和

Clouds cityscapes architecture buildings skyscrapers wallpaper

制定爬取策略

根据网站内容分析可以确定两个策略来进行爬取：1. 在搜索结果页提取所有的data-src并且替换掉"-数字"作为爬取结果；2. 先记录所有的详情页，再从每个详情页中获取data-origin。

代码实现

生成url

用字典记录筛选项及值，

        # 设置筛选条件
        option = {
    
    
            'cats_ids': '',  # 类别，自行修改为需要的类别的编号
            # 'cats_ids': '1',  # example

            'license': '',  # 版权，自行修改为需要的版权的编号
            # 'license': '1',  # example

            'ratio': '',  # 长宽比，
            # 'ratio': str(round(16/9, 2)),  # example 这里是16:9

            # 'resolution_mode': '',  # 分辨率筛选模式
            'resolution_mode': ':',  # 这里是at least
            # 'resolution_mode': ':=:',  # 这里是exactly

            # 'resolution': '',  # 分辨率，
            'resolution': '2560x1440',  # example 这里是2560*1440

            'color': '',
            # 'color': '#80c0e0',  # 颜色，自行修改颜色的编码
            
            'oder': '',  # 自行根据需要修改
        }

在根据字典生成url时，要注意到两个特殊的点：

分辨率筛选时，有exact和at least两种模式，在生成url时的区别在于resolution和值中间的运算符。
长宽比采用的是计算结果，比如16:10的值是1.6。对比https://www.wallpaperup.com/search/results/resolution:2560x1440+order:date_added/1和https://www.wallpaperup.com/search/results/resolution:=:2560x1440+order:date_added/1。

根据字典生成搜索结果url

        url = 'https://www.wallpaperup.com/search/results/'
        for key in option.keys():
            if option[key] == '' or key == 'resolution_mode':
                pass
            elif option[key] == 'resolution':
                url += (key + option['resolution_mode'] + option[key])
            else:
                url += (key + ':' + option[key])
                url += "+"
        url = url[:-1]+"/"+str(page_num)

爬取图片

对于第一种策略，解析到data-src再使用正则替换就行了，不再多说。
第二种策略，获取搜索结果的每一项详情页url的代码：

        r_url = requests.get(url, headers=headers)
        soup_a = BeautifulSoup(r_url.text, 'lxml')
        r_url.close()  # 关闭连接以防止被拒绝

        wallpaper_pages = []
        for links in soup_a.find_all(attrs={
    
    'title': "View wallpaper"}):
            soup2 = BeautifulSoup(str(links), 'lxml')
            wallpaper_pages.append(
                'https://www.wallpaperup.com'+str(soup2.a.attrs['href']))

获取详情页的data-origin

        wallpaper_imgaes_link = []
        for page in wallpaper_pages:
            r_page = requests.get(page, headers=headers)
            soup_page = BeautifulSoup(r_page.text, 'lxml')
            div_img = str(
                (soup_page.find_all(attrs={
    
    'class': 'thumb-wrp'}))[0])
            soup_div_img = BeautifulSoup(div_img, 'lxml')
            wallpaper_imgaes_link.append(
                str(soup_div_img.div.img.attrs['data-original']))

将图片下载到本地

            filename = str(re.compile(r"\d{4}\/\d{2}\/\d{2}\/\d+").findall(
                image_link)[0]).replace("/", "", 2).replace("/", "_")+".jpg"
            print("filename=", filename)
            with open(wallpapers_folder+filename, 'wb+') as f:
                f.write(r_image_link.content)

至此，可以获取每一个搜索结果页的所有图片，只需在外层嵌套一个循环就可以爬取多页的图片了。

环境说明

python version: python3
dependencies: bs4, requests, lxml