使用xpath进行页面数据的局部解析,是最为常用的一种
xpath进行爬虫数据分析是应用比较广泛的,学她是很值得的
xpath核心用法在与xpath表达式的自定义性,且不同的xpath表达式可以在同一个地方声明何使用
下面就两种方案进行对比
#!/usr/bin/env python
# encoding: utf-8
"""
@file: 全国城市.py
@time: 2020/2/29 13:06
"""
import requests
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/80.0.3987.116 Safari/537.36'
}
url = 'https://www.aqistudy.cn/historydata/'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
def city_1():
host_li_list = tree.xpath('//div[@class="bottom"]/ul/li/a')
all_city_names = []
# 热门城市名称
for li in host_li_list:
hot_city_name = li.xpath('./text()')[0]
all_city_names.append(hot_city_name)
# 全部城市
city_names_list = tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a')
for li in city_names_list:
city_name = li.xpath('./text()')[0]
all_city_names.append(city_name)
print(all_city_names, '\n', len(all_city_names))
def city_2():
# 解析到热门城市和所有城市
# //div[@class="bottom"]/ul/li/a 热门城市xpath表达式
# //div[@class="bottom"]/ul/div[2]/li/a 所有城市xpath表达式
a_list = tree.xpath('//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a')
all_city_names = []
for a in a_list:
city_name = a.xpath('./text()')[0]
all_city_names.append(city_name)
print(all_city_names, '\n', len(all_city_names))
if __name__ == '__main__':
city_1()
city_2()
经过上述对比我们不难发现,第二种方案极大地减少了代码量,对于有python基础的童鞋应该知道,就python而言代码量的减少,相应的程序的执行效率也自然的有所提高