使用爬虫的过程中即使再使用time.sleep()函数暂停,对于很多网站仍然会被封锁,因此需要使用代理,网上推荐较多的是西刺代理,本文编写个简单的爬虫来获取西刺代理国内高匿代理的IP加端口,可以获取到地址后,可以在爬虫中构建代理地址池,不断的使用不同的代理发起爬虫,防止被封锁。代码如下:
from bs4 import BeautifulSoup # 解析网页
from fake_useragent import UserAgent # 随机生成User-agent
import urllib.request, csv
#获取代理,使用的是西刺代理,通过解析网页获取代理池
def get_proxy():
# 报头设置
def header(website):
ua = UserAgent()
headers=("User-Agent", ua.random)
opener = urllib.request.build_opener()
opener.addheaders = [headers]
req = opener.open(website).read()
return req
# 读取网页
proxy_api = 'http://www.xicidaili.com/nn'
data = header(proxy_api).decode('utf-8')
data_soup = BeautifulSoup(data, 'lxml')
data_odd = data_soup.select('.odd')
data_ = data_soup.select('.')
# 解析代理网址 获取ip池(100个)
ip,port = [],[]
for i in range(len(data_odd)):
data_temp = data_odd[i].get_text().strip().split('\n')
while '' in data_temp:
data_temp.remove('')
ip.append(data_temp[0])
port.append(data_temp[1])
for i in range(len(data_)):
data_temp = data_[i].get_text().strip().split('\n')
while '' in data_temp:
data_temp.remove('')
ip.append(data_temp[0])
port.append(data_temp[1])
if len(ip) == len(port):
proxy = [':'.join((ip[i],port[i])) for i in range(len(ip))]
#print('成功获取代理ip与port!')
return proxy
else:
print('ip长度与port长度不一致!')
proxy = get_proxy()
with open('proxy.csv','w',encoding = 'utf-8-sig', newline = '') as f:
w = csv.writer(f)
w.writerow(['代理地址+端口'])
with open('proxy.csv', 'a+', encoding = 'utf-8-sig') as f:
w = csv.writer(f, lineterminator='\n')
for i in range(0, len(proxy)):
try:
w.writerow([proxy[i]])
except Exception as e:
print('第 ', i , proxy[i],'error: ', e, '\n')
print("成功写入文件!")
结果写入了CSV,使用utf-8在Windows下乱码,直接改用UTF-8-sig不会乱码,结果如下:
代理地址+端口 |
110.73.42.240:8123 |
118.190.95.35:9001 |
118.190.95.26:9001 |
118.190.95.43:9001 |
122.114.31.177:808 |
118.114.77.47:8080 |
221.10.159.234:1337 |
112.87.103.61:8118 |
183.159.95.200:41373 |
110.73.6.50:8123 |
101.236.60.52:8866 |
114.231.65.99:18118 |
114.231.69.209:18118 |
180.118.243.59:61234 |
101.236.23.202:8866 |
114.246.244.131:8118 |
111.155.116.217:8123 |
61.135.155.82:443 |
101.236.60.48:8866 |
111.155.116.249:8123 |
117.86.16.50:18118 |
101.236.21.22:8866 |
122.246.49.130:8010 |
111.155.116.234:8123 |
117.86.9.188:18118 |
182.114.129.61:37152 |
106.56.102.20:8070 |
221.227.251.117:18118 |
125.120.201.22:6666 |
121.31.101.113:8123 |
180.212.26.202:8118 |
180.118.242.119:808 |
111.155.116.211:8123 |
59.62.165.95:53128 |
115.204.30.147:6666 |
180.125.137.37:8000####检测代理地址是否可用 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063' } print("共计{}个代理地址".format(len(proxy))) fp = open('proxy.csv','w+',newline='',encoding='utf-8-sig') writer = csv.writer(fp) writer.writerow(('代理地址')) test_results = [] for i in range(len(proxy)): ip,port = proxy[i].split(':') try: telnetlib.Telnet(ip, port=port, timeout=20) except: print('{}:{}检测失败'.format(ip,port)) else: print('{}:{}检测成功'.format(ip,port)) writer.writerow((proxy[i])) print("检测完毕") ##加载可用的地址 csv_file = csv.reader(open('proxy.csv','r')) print("可用的代理IP如下:") for ip in csv_file: print(ip) |
.......
当然,上面的代理地址不一定可以使用,这里使用telent来检测代理地址是否可用,并存储下来:
在获取到可用的代理地址池后就可以使用代理发起Request查询,代码示例如下:
#url为带爬取网站的url,proxy_addr为从地址池中随机获取的地址。
def use_proxy(url,proxy_addr):
req=urllib.request.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0")
proxy=urllib.request.ProxyHandler({'http':proxy_addr})
opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
data=urllib.request.urlopen(req).read().decode('utf-8','ignore')
return data
使用随机函数调用use_proxy即可:
for i in range(len(url)):
for data in use_proxy(url[i],proxy[random.randint(3,50)]):.......