版权声明:本文由monkey原创撰写,转载请注明文章来源! https://blog.csdn.net/weixin_44143222/article/details/86614965
效果如图:
思路如下:
1.用户输入一个需要爬取图片的网址。input()
2.导入re模块,用正则判断输入的网址是否正确,否则重新输入!
ret = re.match("(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?", website)
3.导入urllib.request模块
发送请求,读取网页的内容
req = urllib.request.Request(url=website, headers=headers)
web_content = urllib.request.urlopen(req)
content = web_content.read()
4.用正则提取出html代码中的图片(因为不是每个网址都适用,我这里写的是斗鱼的规则,读者可以自己更改正则,爬取你想爬的网站)演示爬取地址:https://www.douyu.com/g_yz
a = re.findall(r'data-original="(.+\.jpg)" src=', content.decode("utf-8"))
5.把提取到的图片链接都,放入一个列表中,我这里还加入了协程,提高爬取的速度
mylist = list()
for x in a:
print(x)
mylist.append(gevent.spawn(downloader, "%s.jpg" % num, x))
num += 1
gevent.joinall(mylist)
6.最后while循环,把这些图片都保存到你的电脑里就OK啦
def downloader(img_name, img_url):
req = urllib.request.urlopen(img_url)
with open(r"C:\Users\monkey\Desktop\%s" % img_name, "wb") as f:
while True:
img_content = req.read(1024)
if img_content:
f.write(img_content)
else:
break
整体实现代码如下:
import urllib.request
import gevent
from gevent import monkey
monkey.patch_all()
import re
def downloader(img_name, img_url):
req = urllib.request.urlopen(img_url)
with open(r"C:\Users\monkey\Desktop\%s" % img_name, "wb") as f:
while True:
img_content = req.read(1024)
if img_content:
f.write(img_content)
else:
break
def main():
num = 0
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/51.0.2704.63 Safari/537.36'}
while True:
website = input("请输入您要爬取的网站链接:")
ret = re.match("(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?", website)
if ret:
print("您输入的网址正确,请稍等我正在为您爬取:%s" % ret.group())
req = urllib.request.Request(url=website, headers=headers)
web_content = urllib.request.urlopen(req)
content = web_content.read()
a = re.findall(r'data-original="(.+\.jpg)" src=', content.decode("utf-8"))
mylist = list()
for x in a:
print(x)
mylist.append(gevent.spawn(downloader, "%s.jpg" % num, x))
num += 1
gevent.joinall(mylist)
if __name__ == '__main__':
main()