Python----爬虫入门（基于正则表达式的实现）

此次要爬取的对象是百度图片

首先放url

https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111111&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E7%A8%8B%E5%BA%8F%E5%91%98&oq=%E7%A8%8B%E5%BA%8F%E5%91%98&rsp=-1

浏览网页源代码可以找到我们图片所放在的位置是JS里

而我们的BeautifulSoup不能直接从JS里面抽取

这时候我们就能用比较传统的方法了，也比较不好理解

那就是用正则表达式。

正则表达式就是一种规则，然后在这个规则之下进行某些操作，常用操作有替换和查找。

import re
import os
import time
import requests
from urllib.request import urlretrieve
# 伪装头部，让网站误以为是使用浏览器
headers={
	'Host':'image.baidu.com',
	'Upgrade-Insecure-Requests':'1',
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
#爬取的目标url
url = "https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111111&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E7%A8%8B%E5%BA%8F%E5%91%98&oq=%E7%A8%8B%E5%BA%8F%E5%91%98&rsp=-1"
#访问url
response = requests.get(url,headers=headers)
# 因为response.content是byte类型，所以首先进行转换
html = str(response.content,'utf-8')
#然后进行正则表达式匹配
list = re.findall(r'"objURL":"(.*?)"',html)
#输出30，一开始加载的只有三十张
print(len(list))
print(list)
#下载图片
#图片的名字
index = 0
for i in list:
	#捕获异常，假设下载不了的就跳过，进行下一张的下载
	try:
		#设置下载的路径
		path = os.path.join('images',str(index)+".jpg")
		#下载
		urlretrieve(i,filename=path)
		index = index+1
		time.sleep(2)
	except Exception as e:
		index = index + 1
		continue

Python----爬虫入门（基于正则表达式的实现）

猜你喜欢