1. 爬取千图网高清图片
import urllib.request
import re
import urllib.error
for i in range(1,10):
# 千图网第一页
pageurl='https://www.58pic.com/piccate/3-156-909-se1-p'+str(i)+'.html'
data=urllib.request.urlopen(pageurl).read().decode("utf-8","ignore")
#正则提取
pat='(//www.58pic.com/newpic/.*?.html)'
imglist=re.compile(pat).findall(data)
print(imglist)
['//www.58pic.com/newpic/34666756.html', '//www.58pic.com/newpic/34664475.html', '//www.58pic.com/newpic/34664471.html', '//www.58pic.com/newpic/34664397.html', '//www.58pic.com/newpic/34664383.html', '//www.58pic.com/newpic/34663375.html', '//www.58pic.com/newpic/34663183.html', '//www.58pic.com/newpic/34662278.html', '//www.58pic.com/newpic/34480033.html', '//www.58pic.com/newpic/34479938.html', '//www.58pic.com/newpic/34479937.html', '//www.58pic.com/newpic/34479855.html', '//www.58pic.com/newpic/34479854.html', '//www.58pic.com/newpic/34479549.html', '//www.58pic.com/newpic/34479548.html', '//www.58pic.com/newpic/34479381.html', '//www.58pic.com/newpic/34479010.html', '//www.58pic.com/newpic/34478964.html', '//www.58pic.com/newpic/34478963.html', '//www.58pic.com/newpic/34432574.html', '//www.58pic.com/newpic/34432554.html', '//www.58pic.com/newpic/34432517.html', '//www.58pic.com/newpic/34426270.html', '//www.58pic.com/newpic/34426034.html', '//www.58pic.com/newpic/34425959.html', '//www.58pic.com/newpic/34425710.html', '//www.58pic.com/newpic/34425658.html', '//www.58pic.com/newpic/34425570.html', '//www.58pic.com/newpic/34425469.html', '//www.58pic.com/newpic/34425122.html', '//www.58pic.com/newpic/34424954.html', '//www.58pic.com/newpic/34424934.html', '//www.58pic.com/newpic/34424029.html', '//www.58pic.com/newpic/34424028.html', '//www.58pic.com/newpic/34423912.html']
'''
for j in range(0,len(imglist)):
try:
thisimg=imglist[j]+"/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0"
#被网站强行裁剪的一小部分
#thisimg=imglist[j]+"/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024"
file="F:/jupyterpycodes/python_pachongfenxi/result/"+str(i)+str(j)+".jpg"
urllib.request.urlretrieve(thisimg,filename=file)
print("第"+str(i)+"页第"+str(j)+"个图片爬取成功")
except urllib.error.URLError as e:
if hasattr(e,"code"):
print(e.code)
if hasattr(e,"reason"):
print(e.reason)
except Exception as e:
print(e)
'''
```python
2. 抓包分析:
即将网络传输发送与接收的数据包进行抓取的操作,做爬虫时,数据并不一-定就在HTML源码中,很可能隐藏在一些网址中,所以,我们要抓取某些数据,就需要进行抓包,分析出对应数据所隐藏在的网址,然后分析规律并爬取。
3. 使用Fiddler进行抓包分析
(爬取源代码中没有的数据)Fiddler默认只能抓取HTTP的数据,抓不到HTTPS的数据。如需要抓HTTPS的数据,需要进行相应设置。
参考网址 https://ask.hellobi.com/blog/weiwei/5159