爬虫学习之13:代理的使用

    使用爬虫的过程中即使再使用time.sleep()函数暂停,对于很多网站仍然会被封锁,因此需要使用代理,网上推荐较多的是西刺代理,本文编写个简单的爬虫来获取西刺代理国内高匿代理的IP加端口,可以获取到地址后,可以在爬虫中构建代理地址池,不断的使用不同的代理发起爬虫,防止被封锁。代码如下:

from bs4 import BeautifulSoup # 解析网页
from fake_useragent import UserAgent # 随机生成User-agent
import urllib.request, csv

#获取代理,使用的是西刺代理,通过解析网页获取代理池
def get_proxy():
    # 报头设置
    def header(website):
        ua = UserAgent()
        headers=("User-Agent", ua.random)
        opener = urllib.request.build_opener()
        opener.addheaders = [headers]
        req = opener.open(website).read()
        return req
    # 读取网页
    proxy_api = 'http://www.xicidaili.com/nn'
    data = header(proxy_api).decode('utf-8')
    data_soup = BeautifulSoup(data, 'lxml')
    data_odd = data_soup.select('.odd')
    data_ = data_soup.select('.')
    # 解析代理网址 获取ip池(100个)
    ip,port = [],[]
    for i in range(len(data_odd)):
        data_temp = data_odd[i].get_text().strip().split('\n')
        while '' in data_temp:
            data_temp.remove('')
        ip.append(data_temp[0])
        port.append(data_temp[1])
    for i in range(len(data_)):
        data_temp = data_[i].get_text().strip().split('\n')
        while '' in data_temp:
            data_temp.remove('')
        ip.append(data_temp[0])
        port.append(data_temp[1])
    if len(ip) == len(port):
        proxy = [':'.join((ip[i],port[i])) for i in range(len(ip))]
        #print('成功获取代理ip与port!')
        return proxy
    else:
        print('ip长度与port长度不一致!')

proxy = get_proxy()

with open('proxy.csv','w',encoding = 'utf-8-sig', newline = '') as f:
    w = csv.writer(f)
    w.writerow(['代理地址+端口'])

with open('proxy.csv', 'a+', encoding = 'utf-8-sig') as f:
    w = csv.writer(f, lineterminator='\n')
    for i in range(0, len(proxy)):
        try:
            w.writerow([proxy[i]])
        except Exception as e:
            print('第 ', i , proxy[i],'error: ', e, '\n')
print("成功写入文件!")

结果写入了CSV,使用utf-8在Windows下乱码,直接改用UTF-8-sig不会乱码,结果如下:

代理地址+端口
110.73.42.240:8123
118.190.95.35:9001
118.190.95.26:9001
118.190.95.43:9001
122.114.31.177:808
118.114.77.47:8080
221.10.159.234:1337
112.87.103.61:8118
183.159.95.200:41373
110.73.6.50:8123
101.236.60.52:8866
114.231.65.99:18118
114.231.69.209:18118
180.118.243.59:61234
101.236.23.202:8866
114.246.244.131:8118
111.155.116.217:8123
61.135.155.82:443
101.236.60.48:8866
111.155.116.249:8123
117.86.16.50:18118
101.236.21.22:8866
122.246.49.130:8010
111.155.116.234:8123
117.86.9.188:18118
182.114.129.61:37152
106.56.102.20:8070
221.227.251.117:18118
125.120.201.22:6666
121.31.101.113:8123
180.212.26.202:8118
180.118.242.119:808
111.155.116.211:8123
59.62.165.95:53128
115.204.30.147:6666
180.125.137.37:8000
####检测代理地址是否可用
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063'
}
print("共计{}个代理地址".format(len(proxy)))
fp = open('proxy.csv','w+',newline='',encoding='utf-8-sig')
writer = csv.writer(fp)
writer.writerow(('代理地址'))
test_results = []
for i in range(len(proxy)):
    ip,port = proxy[i].split(':')
    try:
        telnetlib.Telnet(ip, port=port, timeout=20)
    except:
        print('{}:{}检测失败'.format(ip,port))
    else:
        print('{}:{}检测成功'.format(ip,port))
        writer.writerow((proxy[i]))
print("检测完毕")

##加载可用的地址
csv_file = csv.reader(open('proxy.csv','r'))
print("可用的代理IP如下:")
for ip in csv_file:
    print(ip)

.......

当然,上面的代理地址不一定可以使用,这里使用telent来检测代理地址是否可用,并存储下来:


在获取到可用的代理地址池后就可以使用代理发起Request查询,代码示例如下:

#url为带爬取网站的url,proxy_addr为从地址池中随机获取的地址。
def use_proxy(url,proxy_addr):
    req=urllib.request.Request(url)
    req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0")
    proxy=urllib.request.ProxyHandler({'http':proxy_addr})
    opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
    urllib.request.install_opener(opener)
    data=urllib.request.urlopen(req).read().decode('utf-8','ignore')
    return data

使用随机函数调用use_proxy即可:

for i in range(len(url)):
    for data in use_proxy(url[i],proxy[random.randint(3,50)]):.......



猜你喜欢

转载自blog.csdn.net/cskywit/article/details/81018528