今天爬取的是站长素材中的免费素材。
先讲一下流程,首先获取响应数据这些就不讲了,获取完之后进行数据解析,完成之后下一步就是对详情页下载地址出再次进行数据解析,然后保存数据,光说可能讲不清楚,那我们就开始上代码,实操一下。
import requests
import os
import random
from lxml import etree
if __name__=='__main__':
if not os.path.exists('./jianlisucai'):
os.mkdir('./jianlisucai')
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36',
}
url = 'http://sc.chinaz.com/jianli/free_%d.html'
#分页操作
for pageNum in range(1,5):
if pageNum == 1:
new_url = 'http://sc.chinaz.com/jianli/free.html'
else:
new_url = format(url%pageNum)
#获取数据
response = requests.get(url=new_url,headers=headers)
response.encoding ='utf-8'
page_text = response.text
#实例化对象
tree=etree.HTML(page_text)
#定位简历详情页标签//*为全局标签
div_list = tree.xpath('//div[@id="container"]/div')
print(div_list)
for div in div_list:
#获取详情页的url
detail_url = div.xpath('./a/@href')[0]
#print(detail_url)
#设置简历名字
page_name = div.xpath('./a/img/@alt')[0]+'.rar'
#print(page_name)
#获取详情页数据
responsee = requests.get(url=detail_url,headers=headers)
responsee.encoding = 'utf-8'
detail_data = responsee.text
#print(detail_data)
tree2= etree.HTML(detail_data) #实例化
#选择下载地址1 即li[1]
download_list = tree2.xpath('//div[@id="down"]/div[2]/ul/li[1]/a/@href')[0]
#print(download_list)
download_data= requests.get(url=download_list,headers=headers)
download_data.encoding='utf-8'
download_data=download_data.content
filepath = 'jianlisucai/'+page_name
with open(filepath,'wb') as fp:
fp.write(download_data)
print(page_name,'爬取成功!!!')
运行结果大家可以自行验证。下面主要说几个我做的时候遇到的错误。
- 请求头错误,
headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36', }
一定要写成User-Agent ,将-换成空格之后会出现请求头的错误,即Error 400. The request has an invalid header name
2.分页操作
分页操作是一定要将第一页和后面页数的url区分开。
3.解码问题
解决中文乱码的俩种方法:
-img_name.encode(‘iso-8859-1’).decode(‘gbk’)
-response.encoding=‘utf-8’
俩种解码方法可以换着用,我习惯用第二种。