之前用python的minidom写过解析xml的脚本文件,在前期是比较好用的,因为xml文件比较小。但是当xml文件超过了70M的时候,minidom不仅效率低,而且会占用非常大的内存空间,因为他是将整个xml读入进去并且按照整个xml树进行建树(虽然这样写代码逻辑清晰,但是确实效率低,内存占用高)。70M的xml,我8G内存吃了4个多G,太可怕了。考虑到以后这个读取的xml文件可能还需要扩大,所以抓紧时间写了一个一个新的读取脚本。
在此之前,参考了这篇文章以及这篇文章之后,决定采用里面说的ET_iter方式实现。
然后,我找到了这个博主的文章,仿照这上面的代码进行了进行了编写:
# coding=utf-8 __author__ = 'Arthur' import mysql.connector import sys import xml.etree.cElementTree as ET if __name__=="__main__": for event, elem in ET.iterparse("test2.xml", events=('start','end')): if event == 'start': if elem.tag=='product' or elem.tag=='property' or elem.tag=='evaluation': print(elem.attrib) elif elem.tag == 'result': a_result = {} a_result=elem.attrib a_result['value']=elem.text if(elem.text==None): print("result none") else: print(a_result) elif event == 'end': if elem.tag == 'products': print("deal with products over") elif elem.tag == 'propertys': print("deal with propertys over") elif elem.tag == 'evaluations': print("deal with evaluations over") elif elem.tag == 'results': print("deal with results over") elem.clear()前面使用自己构造的xml文件发现没有问题:
<?xml version='1.0' encoding='utf-8'?> <testresults source="ICRT EvalDB" type="data" user="unknown"> <project id_project="697" icrt_code="IC16539" name="Combined Wearables" comment=""> <snapshots> <snapshot id_snapshot="4" name="Combined snapshot" timestamp_created="1471515160" timestamp_lastchange="1482147798" time_lastchange="2016-12-19 (11:43)"> <manufacturers> <manufacturer id_manufacturer="1" name="Apple" comment="" timestamp_created="1465471929" timestamp_lastchange="0" /> <manufacturer id_manufacturer="2" name="Fitbit" comment="" timestamp_created="1465471929" timestamp_lastchange="0" /> </manufacturers> <productgroups> <productgroup id_productgroup="1" name="SMARTWATCH" comment="" timestamp_created="1465471929" timestamp_lastchange="0" /> <productgroup id_productgroup="2" name="FITNESS TRACKER" comment="" timestamp_created="1465471929" timestamp_lastchange="0" /> </productgroups> <products> <product id_product="10" icrt_code="IC16539-0036-00" modelname="Gear S2" completename="Samsung Gear S2" shortname="" systemmodelid="" releasedate="" labreportdate="2016-05-27T00:00:00.000" labarrivaldate="2016-05-06T00:00:00.000" boughtbyorganisation="WHICH" serialnumber="RFAH105HFQF" articlenumber="8.80608808859E+12" comment="" id_productgroup="1" id_manufacturer="9" sortorder="0" batch="1" labcode="" parentmodelcode="" similarmodelscodes="" testtype="" picture_lores="" picture_hires="" timestamp_created="1465471929" timestamp_lastchange="1466062628" /> <product id_product="11" icrt_code="IC16539-0040-00" modelname="Vivofit 3" completename="Garmin Vivofit 3" shortname="" systemmodelid="" releasedate="" labreportdate="2016-06-15T00:00:00.000" labarrivaldate="2016-06-24T00:00:00.000" boughtbyorganisation="WHICH" serialnumber="4R0201708" articlenumber="53759 15457" comment="" id_productgroup="2" id_manufacturer="3" sortorder="0" batch="2" labcode="" parentmodelcode="" similarmodelscodes="" testtype="" picture_lores="" picture_hires="" timestamp_created="1469800248" timestamp_lastchange="1475593828" /> <product id_product="12" icrt_code="IC16539-0047-00" modelname="Go" completename="Withings Go" shortname="" systemmodelid="" releasedate="" labreportdate="2016-06-15T00:00:00.000" labarrivaldate="2016-06-24T00:00:00.000" boughtbyorganisation="WHICH" serialnumber="00:24:E4:39:F0:0D" articlenumber="700546 701481" comment="" id_productgroup="2" id_manufacturer="10" sortorder="0" batch="2" labcode="" parentmodelcode="" similarmodelscodes="" testtype="" picture_lores="" picture_hires="" timestamp_created="1469800248" timestamp_lastchange="1475593828" /> </products> <propertygroups> <propertygroup id_propertygroup="36" name="Features|inventory" comment="" timestamp_created="1465222484" timestamp_lastchange="0" /> <propertygroup id_propertygroup="37" name="Features|Smart" comment="" timestamp_created="1465222484" timestamp_lastchange="0" /> </propertygroups> <propertys> <property id_property="381" id_propertygroup="" binding="FIRMWARE" name="Firmware version on device" comment="" max="0" min="0" unit="" precision="0" type="String" use="1" testprogram="1.1.3" timestamp_created="1465222485" timestamp_lastchange="1465222485" /> <property id_property="382" id_propertygroup="" binding="COMPATABILITY" name="What phones are compatible with device" comment="" max="0" min="0" unit="" precision="0" type="String" use="1" testprogram="1.1.7" timestamp_created="1465222485" timestamp_lastchange="1468831229" /> </propertys> <calculationtypes> <calculationtype id_calculationtype="0" name="Arithmetic mean calculation" /> <calculationtype id_calculationtype="5" name="Geometric mean calculation" /> <calculationtype id_calculationtype="1" name="Versatility calculation" /> <calculationtype id_calculationtype="2" name="Free formula calculation (complex)" /> <calculationtype id_calculationtype="3" name="Minimum calculation" /> <calculationtype id_calculationtype="4" name="Maximum calculation" /> </calculationtypes> <evaluations> <evaluation id_evaluation="3165" id_childs="3185,3199,3176,3166,3180,3175,3195,3615" id_parent="0" id_calculationtype="0" name="total test result" binding="" use_inheritna="0" use_lookuptable="0" use_limiting="0" weighting_normalized="0" weighting_given="1" lookuptable="0.5,1.5,2.5,3.5,4.5,5.5" unit="" precision="3" timestamp_created="1465222499" timestamp_lastchange="1467972637" /> <evaluation id_evaluation="3166" id_childs="3167" id_parent="3165" id_calculationtype="0" name="App" binding="" use_inheritna="0" use_lookuptable="0" use_limiting="0" weighting_normalized="0" weighting_given="0" lookuptable="0.5,1.5,2.5,3.5,4.5,5.5" unit="" precision="3" timestamp_created="1465222499" timestamp_lastchange="1467969418" /> </evaluations> <results> <result id_product="1" id_evaluation="3165" is_downgrading="0" downgrading_value="">3.98268146</result> <result id_product="1" id_evaluation="100000635" is_downgrading="0" downgrading_value="">Provides reminders to stand every hour. You can set progress updates to be given every 4, 6 or 8 hours. Congratulates you when you complete a goal and provides individual feedback and history of activity data. Notifications to focus on specific goals _eg activity__, tells you what percentage of your goal is complete </result> <result id_product="1" id_evaluation="100000636" is_downgrading="0" downgrading_value="">1</result> <result id_product="1" id_evaluation="100000637" is_downgrading="0" downgrading_value="">Using the workout app gives you a breakdown of steps, total and active calories and distance covered for that session as well adding these values onto daily accumulated totals</result> <result id_product="1" id_evaluation="100000638" is_downgrading="0" downgrading_value="">1</result> </results> </snapshot> </snapshots> </project> </testresults>不过当真正使用的时候,发现有时候文本elem.text读取不正确,明明有值但是读取的时候发现还是None。调了半天都不知道为什么(因为自己构造的xml始终不是真实的,所以肯定不能完全模拟),找了半天终于找到了一段官方说明:
If you need a fully populated element, look for “end” events instead.
好了,原来是因为start事件开始的时候只能保证属性存在,不能保证value值以及子节点存在。所以目测改成了使用end事件响应就对了。然而我改成end事件响应过后,发现居然连小xml文件读取都有问题……这是为什么呢?好在这个问题好调试,调试一番发现问题其实很简单:因为我的触发信号是start以及end,但是start触发过后什么也没有做就把elem.clear()了,结果到end事件进来响应的时候只有一个空节点了……
所以说!!!!!触发事件一般不用使用start和end两个触发条件,之前看那个博主同时使用start以及end完全不必要,使用一个就好,除非你有其他特殊需求,比如需要继续使用根节点之类的,读取值的时候要保证是在end的时候读取并且end时当前节点没有clear.
最后完成的有效代码:
# coding=utf-8 __author__ = 'Arthur' import mysql.connector import sys import xml.etree.cElementTree as ET if __name__=="__main__": for event, elem in ET.iterparse("test.xml", events=('end',)):#注意这里只使用end进行触发即可 if elem.tag=='product' or elem.tag=='property' or elem.tag=='evaluation': print(elem.attrib) elif elem.tag == 'result': a_result = {} a_result=elem.attrib a_result['value']=elem.text if(elem.text==None): print("result none") else: print(a_result) if elem.tag == 'products': print("deal with products over") elif elem.tag == 'propertys': print("deal with propertys over") elif elem.tag == 'evaluations': print("deal with evaluations over") elif elem.tag == 'results': print("deal with results over") elem.clear()
从调研新XML解析方法到实现重构代码只花了1小时,结果写出bug调代码一搞就是1个半小时,蛋疼。