近期正在学习python ,结合自己过往的工作,所以闲来无事,试下爬17track 的轨迹。
爬取途径是:利用静态页面爬取,需要了解前端网页知识。
三方包:pyquery
话不多说,看代码吧:
#!/usr/bin/env python3
#coding=utf-8
from pyquery import PyQuery as pq
import pymysql
def get_time(d1):
l=[]
for data in d1('time'):
msg=d1(data).text()
#print(msg[0:11],len(msg))
l.append(msg[0:10])
return l
s=[]
for data in d1('p'):
msg1=d1(data).text()
s.append(msg1)
return s
d = pq(filename="18.html")
d1 = d(".ori-block")#查找类是ori-block的html模块
d2 = d('.text-uppercase').text()获取类是text-uppercase的文本内容
print (type(d2))#测试返回的数据类型,为str
i=0
while i < len(get_time(d1)):
print(d2+"/"+get_time(d1)[i]+"/"+get_message(d1)[i])
i += 1
main()
抓取结果如下:
1Z3Y18900337899118/2018-07-05/LAS VEGAS, NV, US, DELIVERED
1Z3Y18900337899118/2018-07-05/Las Vegas, NV, United States, Destination Scan
1Z3Y18900337899118/2018-07-04/Las Vegas, NV, United States, Arrival Scan
1Z3Y18900337899118/2018-07-04/Departure Scan
1Z3Y18900337899118/2018-07-04/Arrival Scan
1Z3Y18900337899118/2018-07-04/Ontario, CA, United States, Departure Scan
1Z3Y18900337899118/2018-07-04/Origin Scan
1Z3Y18900337899118/2018-06-30/United States, Order Processed: Ready for UPS
后续会更新 :
url动态抓取
40个包裹抓取
超过40个抓取
python API抓取等。。。