如何去除网页噪声提取数据(01) —— 去哪儿网
1. 需求介绍
- 今天的目标是爬取 “去哪儿网” 上的数据信息,去哪儿网上的数据是非常珍贵的,所以这个数据被保护的也很严格,不仅仅是原始数据获取较为困难,而且渲染后的数据也加入了大量的混淆。
- 尽管难度很大,但是作为一直修炼千年的蜘蛛精,是没有爬不下来的数据滴。
- 下面就看我如何织网,如何捕获猎物……呃,不对,是爬取数据……
2. 环境
- python 3.6.1
- 系统:win7
- IDE:pycharm
- 安装过chrome浏览器
- 配置好chromedriver(设置好环境变量)
- selenium 3.7.0
3. 网站分析
3.1. 分析网页请求
- 通过请求分析,可以看到网页本身的代码是很少的,数据基本上都来自于ajax请求。
- 我们再看看ajax返回的json数据:针对其中每条请求返回的json数据,都无法查找到价格的数据,说明信息隐藏的很深很深…
- 但是,不要灰心不要哭,我们还有最后的杀手锏,绝世好剑:selenium,此剑一出,毁天灭地。
3.2. 分析价格数据本身
- 既然我们决定采用selenium来爬取数据,那就有必要分析一下渲染后数据的呈现形式了。
- 第一:通过复制,我们发现这个价格数据是无法被复制的:网页显示3028,但是复制出来的数据是232803,说明数据是被混淆过的。
- 第二:通过审查元素,查看数据是如何经过混淆的:如下图所示,混淆的策略是,先在坐标点放上4个数字,然后用其他数字取代其中某两个坐标点上的数字,相当于覆盖(叠在上层,让底层的数字不可见),所以我们复制的时候是把所有的数字都复制下来了232803,但是用户看到的数字是3028
- 混淆过程如下:
- 通过上面的分析过程,发现对于一个4位数字的机票价格,第一步先用四个 i 标签渲染,再用两个 b 标签去绝对定位偏移量,覆盖故意展示错误的 i 标签,最后在视觉上形成正确的价格…我们知道衣服是怎么穿上去的,那么将这件外衣脱下来,自然是很简单的事情了
4. 代码实现
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time
def handlePrice(priceSelector):
xStart = priceSelector.location.get("x")
finalWidth = priceSelector.size.get("width")
originalArray = [-1 for i in range(0, finalWidth + 1)]
print(f"len = {len(originalArray)}")
firstLevelLst = priceSelector.find_elements_by_xpath(".//b")
for firstLevel in firstLevelLst:
secondLevelLst = firstLevel.find_elements_by_xpath("./i")
if secondLevelLst:
for secondLevel in secondLevelLst:
xStartSecond = secondLevel.location.get("x") - xStart
xWidthSecond = secondLevel.size.get("width")
originalArray[xStartSecond] = secondLevel.text
for i in range(xStartSecond + 1, xStartSecond + xWidthSecond - 1):
originalArray[i] = -1
print(f"secondX = {xStartSecond}, secondLevel = {secondLevel.text}, xWidthSecond = {xWidthSecond}")
else:
xStartFirst = firstLevel.location.get("x") - xStart
xWidthFirst = firstLevel.size.get("width")
originalArray[xStartFirst] = firstLevel.text
for i in range(xStartFirst + 1, xStartFirst + xWidthFirst):
originalArray[i] = -1
print(f"firstX = {xStartFirst}, firstLevel = {firstLevel.text}, xWidthFirst = {xWidthFirst}")
finalPrice = ""
for elem in originalArray:
if elem != -1:
print(f"elem = {elem}", end=', ')
finalPrice += str(elem.strip())
return int(finalPrice) if finalPrice != "" else 0
if __name__ == "__main__":
chrome_options = webdriver.ChromeOptions()
extension_path = 'D:/extension/XPath-Helper_v2.0.2.crx'
chrome_options.add_extension(extension_path)
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.maximize_window()
qunaerUrl = "https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E6%8B%89%E8%90%A8&searchArrivalAirport=%E6%B7%B1%E5%9C%B3&searchDepartureTime=2018-05-13&searchArrivalTime=2018-05-18&nextNDays=0&startSearch=true&fromCode=LXA&toCode=SZX&from=near_flight&lowestPrice=null"
browser.get(qunaerUrl)
time.sleep(10)
allAirLst = browser.find_elements_by_xpath("//div[@class='b-airfly']")
for airInfo in allAirLst:
name = airInfo.find_element_by_xpath(".//div[@class='air']/span").text
startTime = airInfo.find_element_by_xpath(".//div[@class='sep-lf']/h2").text
endTime = airInfo.find_element_by_xpath(".//div[@class='sep-rt']/h2").text
priceSelector = airInfo.find_element_by_xpath(".//span[@class='prc_wp']")
finalPrice = handlePrice(priceSelector)
print(f"\n####{name} {startTime} {endTime} price:{finalPrice}")
5. 运行结果
E:\Miniconda\python.exe E:/PyCharmCode/myDocument/qunaerwang.py
len = 73
secondX = 0, secondLevel = 0, xWidthSecond = 18
secondX = 18, secondLevel = 6, xWidthSecond = 18
secondX = 36, secondLevel = 0, xWidthSecond = 18
secondX = 54, secondLevel = 4, xWidthSecond = 18
firstX = 18, firstLevel = 0, xWidthFirst = 18
firstX = 54, firstLevel = 8, xWidthFirst = 18
firstX = 0, firstLevel = 3, xWidthFirst = 18
firstX = 36, firstLevel = 2, xWidthFirst = 18
elem = 3, elem = 0, elem = 2, elem = 8,
len = 73
secondX = 0, secondLevel = 5, xWidthSecond = 18
secondX = 18, secondLevel = 1, xWidthSecond = 18
secondX = 36, secondLevel = 0, xWidthSecond = 18
secondX = 54, secondLevel = 6, xWidthSecond = 18
firstX = 0, firstLevel = 3, xWidthFirst = 18
firstX = 54, firstLevel = 0, xWidthFirst = 18
firstX = 18, firstLevel = 1, xWidthFirst = 18
elem = 3, elem = 1, elem = 0, elem = 0,
len = 73
secondX = 0, secondLevel = 6, xWidthSecond = 18
secondX = 18, secondLevel = 9, xWidthSecond = 18
secondX = 36, secondLevel = 6, xWidthSecond = 18
secondX = 54, secondLevel = 0, xWidthSecond = 18
firstX = 36, firstLevel = 0, xWidthFirst = 18
firstX = 18, firstLevel = 1, xWidthFirst = 18
firstX = 0, firstLevel = 3, xWidthFirst = 18
elem = 3, elem = 1, elem = 0, elem = 0,
len = 73
secondX = 0, secondLevel = 4, xWidthSecond = 18
secondX = 18, secondLevel = 1, xWidthSecond = 18
secondX = 36, secondLevel = 0, xWidthSecond = 18
secondX = 54, secondLevel = 4, xWidthSecond = 18
firstX = 54, firstLevel = 0, xWidthFirst = 18
firstX = 0, firstLevel = 3, xWidthFirst = 18
elem = 3, elem = 1, elem = 0, elem = 0,
len = 73
secondX = 0, secondLevel = 1, xWidthSecond = 18
secondX = 18, secondLevel = 9, xWidthSecond = 18
secondX = 36, secondLevel = 0, xWidthSecond = 18
secondX = 54, secondLevel = 1, xWidthSecond = 18
firstX = 54, firstLevel = 9, xWidthFirst = 18
firstX = 18, firstLevel = 4, xWidthFirst = 18
firstX = 36, firstLevel = 0, xWidthFirst = 18
elem = 1, elem = 4, elem = 0, elem = 9,
len = 55
secondX = 0, secondLevel = 9, xWidthSecond = 18
secondX = 18, secondLevel = 7, xWidthSecond = 18
secondX = 36, secondLevel = 7, xWidthSecond = 18
firstX = 36, firstLevel = 0, xWidthFirst = 18
elem = 9, elem = 7, elem = 0,
len = 55
secondX = 0, secondLevel = 1, xWidthSecond = 18
secondX = 18, secondLevel = 7, xWidthSecond = 18
secondX = 36, secondLevel = 5, xWidthSecond = 18
firstX = 36, firstLevel = 0, xWidthFirst = 18
firstX = 0, firstLevel = 9, xWidthFirst = 18
firstX = 18, firstLevel = 2, xWidthFirst = 18
elem = 9, elem = 2, elem = 0,
len = 73
secondX = 0, secondLevel = 1, xWidthSecond = 18
secondX = 18, secondLevel = 9, xWidthSecond = 18
secondX = 36, secondLevel = 5, xWidthSecond = 18
secondX = 54, secondLevel = 5, xWidthSecond = 18
firstX = 18, firstLevel = 1, xWidthFirst = 18
firstX = 36, firstLevel = 3, xWidthFirst = 18
firstX = 54, firstLevel = 2, xWidthFirst = 18
elem = 1, elem = 1, elem = 3, elem = 2,
len = 73
secondX = 0, secondLevel = 7, xWidthSecond = 18
secondX = 18, secondLevel = 0, xWidthSecond = 18
secondX = 36, secondLevel = 7, xWidthSecond = 18
secondX = 54, secondLevel = 5, xWidthSecond = 18
firstX = 0, firstLevel = 1, xWidthFirst = 18
firstX = 36, firstLevel = 3, xWidthFirst = 18
firstX = 54, firstLevel = 7, xWidthFirst = 18
elem = 1, elem = 0, elem = 3, elem = 7,