今天。如果我没有记错,之前我在30篇提到过要发数据去重的操作,我这里暂时只研究了一种方法,供大家参考。
之前的代码上稍作修改,如下图:
want = str(input('是否深入爬取?'))
if want == '是':
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 Edg/84.0.522.40'
}
request = urllib.request.Request(next_href,headers=headers)
response = urllib.request.urlopen(request)
html = response.read().decode('gb2312','ignore')
xpath_road2 ='//a/text()'
select = etree.HTML(html)
next_page_text = select.xpath(xpath_road2)
next_text = list(set(next_page_text))
for t in next_text:
with open(r'4399深入爬取试验1.txt','a',encoding='utf-8') as f:
f.write(t+'\n')
elif want == '否':
break
else:
print('输入错误!!')
break
运行结果:
打开文件,如下图,后面还有许多内容,就不全部展示了,大家请见谅。
最后,感谢大家前来观看鄙人的文章,文中或有诸多不妥之处,还望指出和海涵。