使用requests模块下载爬虫百度图片

一、前言

在github上找到个输入关键词和下载数量即可爬虫多张百度图片的方法，实际测试发现不支持中文关键词，并且最多只能下载60张以内，经过修改后可支持中文，并能下载多张图片。

二、代码

首先需要安装requests模块，该方法主要是使用http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word='+word+'&pn='+str_pn+'&gsm='+str_gsm+'&ct=&ic=0&lm=-1&width=0&height=的网络接口，pn表示了百度图片上图片的索引，故可以不断尝试pn爬下图片来。

#-*- coding:utf-8 -*-
import re
import requests
import sys,os

type=sys.getfilesystemencoding()


def dowmloadPic(html,keyword , i ):
    pic_url = re.findall('"objURL":"(.*?)",',html,re.S)   
    print '找到关键词:'+keyword+'的图片，现在开始下载图片...'
    for each in pic_url:
        print u'正在下载第'+str(i+1)+u'张图片，图片地址:'+str(each)
        try:
            pic= requests.get(each, timeout=50)
        except  Exception,ex :
            print u'【错误】当前图片无法下载' 
            continue
        string = 'pictures\\'+keyword+'_'+str(i) + '.jpg'
        #resolve the problem of encode, make sure that chinese name could be store
        fp = open(string.decode('utf-8').encode('cp936'),'wb')
        fp.write(pic.content)
        fp.close()
        i += 1
    return i

            




if __name__ == '__main__':
    word =  raw_input('Input keywords:')
    word = word.decode('cp936').encode('utf-8')
    #url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word='+word+'&ct=201326592&v=flip'
    pnMax = input('Input max pn:')
    pncount = 0
    gsm = 80  #这个值不知干嘛的
    str_gsm =str(gsm)
    if not os.path.exists('pictures'):
        os.mkdir('pictures')
    while pncount<pnMax:
        str_pn = str(pncount)
        url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word='+word+'&pn='+str_pn+'&gsm='+str_gsm+'&ct=&ic=0&lm=-1&width=0&height=0'
        result = requests.get(url)
        pncount = dowmloadPic(result.text,word ,pncount)
    print u'下载完毕'

四、代码运行及结果

做深度学习或其他分类，样本不够可以爬虫找样本

使用requests模块下载爬虫百度图片

猜你喜欢