此方法适合初学者,利用的是面向函数的方法.先上截图,由于网站图片众多,只爬取了[校花]这一类,好了废话不多说,接下来正式开始!!!
首先导入第三方模块,定义主函数:
import os #创建保存路径 import urllib2 from lxml import etree
if __name__ == "__main__": urls = [] #创建空列表,以保存校花分类多页的套图链接 url = "http://www.mm131.com/xiaohua/" urls.append(url) pn = 2 while pn < 7: #因为校花这类共有六页 page = 'list_2_' + str(pn) + '.html' fullurl = url + page #组合完整url urls.append(fullurl) pn += 1 for link in urls: loadPage(link)
仔细查看地址栏地址,可总结出规律
接下来对这6页的所有url调用发请求的函数
def loadPage(link): # 对链接发请求,获取图片(套图的一张图片)链接 headers = { # 请求报头,模拟浏览器,以免被封IP "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.170 Safari/537.36", } request = urllib2.Request(link, headers=headers) # 发请求,获取响应 html = urllib2.urlopen(request).read() content = etree.HTML(html) # 解析获得的html页面 # 取出所有图片链接的集合 link_list = content.xpath('//dl[@class="list-left public-box"]//dd//a[@target="_blank"]/@href') for link in link_list: morePage(link)
每一个校花的套图利都不止一张图片,接下来要获取多张
def morePage(link): # 获取套图的多张图片 headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.170 Safari/537.36", } request = urllib2.Request(link, headers=headers) html = urllib2.urlopen(request).read() u = "http://www.mm131.com/xiaohua/" content = etree.HTML(html) #解析html页面 # 取出所有图片链接的集合 link_list = content.xpath('//div[@class="content-page"]//a/@href')
利用xpath获取的不是完整的urls
扫描二维码关注公众号,回复:
2597449 查看本文章
接下来组合完整urls,
for link in link_list: fullurl = u + link #print(fullurl) loadImg(fullurl)
调用获取图片链接的函数
def loadImg(fullurl): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.170 Safari/537.36", "Referer": "http://www.mm131.com/xiaohua/2001.html", # 此地方添加Referer 为了防止重定向 } request = urllib2.Request(fullurl, headers=headers) html = urllib2.urlopen(request).read() content = etree.HTML(html) link_list = content.xpath('//div[@class="content-pic"]//a//img/@src') # 获取图片链接 filenames = content.xpath('//div[@class="content"]/h5/text()') # 获取每张图片的名字 for link in link_list: # 取出每个图片的连接 writeImg(link,filenames)
最后保存图片:
def writeImg(link, filenames): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36", "Referer": "http://www.mm131.com/xiaohua/2001.html", } request = urllib2.Request(link, headers = headers) image = urllib2.urlopen(request).read() post = link[-4:] for filename in filenames: name = filename + post file_path = "./Girls/" # 新建路径 if not os.path.exists(file_path): os.mkdir(file_path) with open(file_path+name, "wb") as f: f.write(image)
然后运行py文件,就能获取全部图片了,共一千多张,慢慢欣赏吧