Python使用ElementTree解析XML文档

19.7.1. 综述

这是关于使用xml.etree.ElementTree (ET)的简要综述，目的是演示如何创建block和模块的基本概念。

19.7.1.1. XML 树和elements

XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. ET has two classes for this purpose - ElementTree 表示整个XML文档, and Element 表示树中的一个节点。遍历整个文档r（读写文件）通常使用 ElementTree 遍历单独的节点或者子节点通常使用element 。

19.7.1.2. 解析 XML

我们使用下面的XML文档做为示例:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

我们有多种方法导入数据。

从硬盘文件导入：

import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()

通过字符串导入：

root = ET.fromstring(country_data_as_string)

fromstring() 解析XML时直接将字符串转换为一个 Element，解析树的根节点。其他的解析函数会建立一个 ElementTree。一个Element, 根节点有一个tag以及一些列属性（保存在dictionary中）

>>> root.tag
'data'
>>> root.attrib
{}

有一些列孩子节点可供遍历：

>>> for child in root:
...   print child.tag, child.attrib
...
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}

孩子节点是嵌套的，我们可以通过索引访问特定的孩子节点。

>>> root[0][1].text
'2008'

19.7.1.3. 查找感兴趣的element

Element 拥有一些方法来帮助我们迭代遍历其子树。例如：Element.iter():

>>> for neighbor in root.iter('neighbor'):
...   print neighbor.attrib
...
{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}

Element.findall() 查找当前element的孩子的属于某个tag的element。 Element.find() 查找属于某个tag的第一个element, Element.text 访问element的文本内容。 Element.get()获取element的属性。:

扫描二维码关注公众号，回复： 3193375 查看本文章

>>> for country in root.findall('country'):
...   rank = country.find('rank').text
...   name = country.get('name')
...   print name, rank
...
Liechtenstein 1
Singapore 4
Panama 68

使用XPath.可以更加巧妙的访问element。

19.7.1.4. 修改XML文件

ElementTree 提供了一个简单的方法来建立XML文档并将其写入文件。 ElementTree.write() 提供了这个功能。

一旦被建立，一个 Element 对象可能会进行以下操作：改变文本（比如Element.text), 添加或修改属性 (Element.set() ), 添加孩子（例如 Element.append()).

假设我们想将每个国家的排名+1，并且增加一个updated属性：

>>> for rank in root.iter('rank'):
...   new_rank = int(rank.text) + 1
...   rank.text = str(new_rank)
...   rank.set('updated', 'yes')
...
>>> tree.write('output.xml')

我们的XML现在是这样的：

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank updated="yes">69</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

我们可以使用这个函数来删除节点：Element.remove(). 让我们删除所有排名大于50的国家：

>>> for country in root.findall('country'):
...   rank = int(country.find('rank').text)
...   if rank > 50:
...     root.remove(country)
...
>>> tree.write('output.xml')

我们的XML现在是这样的：

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
</data>

19.7.1.5. 创建XML文档：

SubElement() 函数也提供了一个为已有element创建子element的简便方法：

>>> a = ET.Element('a')
>>> b = ET.SubElement(a, 'b')
>>> c = ET.SubElement(a, 'c')
>>> d = ET.SubElement(c, 'd')
>>> ET.dump(a)
<a><b /><c><d /></c></a>

转自:https://www.cnblogs.com/CheeseZH/p/4026686.html