github项目：https://github.com/lei940324/toy/tree/master/笔记

基础

xpath简介

XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。

XPath 是 W3C XSLT 标准的主要元素，并且 XQuery 和 XPointer 都构建于 XPath 表达之上。

因此，对 XPath 的理解是很多高级 XML 应用的基础。

xpath的一般用法

表达式	描述
nodename	选取此节点的所有子节点。
/	从根节点选取。
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
.	选取当前节点。
…	选取当前节点的父节点。
@	选取属性。
*	通配符

实例

在下面的表格中，列出了一些路径表达式以及表达式的结果：

路径表达式	结果
bookstore	选取 bookstore 元素的所有子节点。
/bookstore	选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！
bookstore/book	选取属于 bookstore 的子元素的所有 book 元素。
//book	选取所有 book 子元素，而不管它们在文档中的位置。
bookstore//book	选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。
//@lang	选取名为 lang 的所有属性。

注意：xpath第一个元素从1开始，python则从0开始

选取节点

首先载入 lxml 库

from lxml import etree

随便举一个 xml 例子：

xml = '''
<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

<book>
  <title lang="chinese">Tom</title>
  <price>10</price>
</book>
</bookstore>
'''

selector = etree.HTML(xml)

print('第1个 book 元素: ')
print(selector.xpath('//book[1]//text()'))

print('\n第2个 book 元素: ')
print(selector.xpath('//book[2]//text()'))

print('\n最后一个 book 元素: ')
print(selector.xpath('//book[last()]//text()'))

print('\n倒数第2个 book 元素: ')
print(selector.xpath('//book[last()-1]//text()'))

print('\n前2个 book 元素: ')
print(selector.xpath('//book[position() < 3]//text()'))

print('\n选取所有拥有名为 lang 的属性的 title 元素: ')
print(selector.xpath('//title[@lang]//text()'))

print('\n选取所有 title 元素，且这些元素拥有值为 chinese 的 lang 属性: ')
print(selector.xpath('//title[@lang="chinese"]//text()'))

print('\n选取所有的title与price节点: ')
print(selector.xpath('//title//text() | //price//text()'))

第1个 book 元素: 
['\n  ', 'Harry Potter', '\n  ', '29.99', '\n']

第2个 book 元素: 
['\n  ', 'Learning XML', '\n  ', '39.95', '\n']

最后一个 book 元素: 
['\n  ', 'Tom', '\n  ', '10', '\n']

倒数第2个 book 元素: 
['\n  ', 'Learning XML', '\n  ', '39.95', '\n']

前2个 book 元素: 
['\n  ', 'Harry Potter', '\n  ', '29.99', '\n', '\n  ', 'Learning XML', '\n  ', '39.95', '\n']

选取所有拥有名为 lang 的属性的 title 元素: 
['Harry Potter', 'Learning XML', 'Tom']

选取所有 title 元素，且这些元素拥有值为 chinese 的 lang 属性: 
['Tom']

选取所有的title与price节点: 
['Harry Potter', '29.99', 'Learning XML', '39.95', 'Tom', '10']

位置路径表达式

位置路径可以是绝对的，也可以是相对的。

绝对路径起始于正斜杠( / )，而相对路径不会这样。在两种情况中，位置路径均包括一个或多个步，每个步均被斜杠分割：

绝对位置路径：/step/step/...

相对位置路径：step/step/...

Xpath轴

轴可以定义相对于当前节点的节点集

selector.xpath('//title/ancestor::*')   # 选取title节点的所有先辈节点（父、祖父）

[<Element html at 0x2ca0097e988>,
 <Element body at 0x2ca00878148>,
 <Element bookstore at 0x2ca0097ee88>,
 <Element book at 0x2ca0097e608>,
 <Element book at 0x2ca0097e788>,
 <Element book at 0x2ca0097e488>]

selector.xpath('//title/ancestor-or-self::*')  # 选取 title节点的所有先辈（父、祖父等）以及当前节点本身。

[<Element html at 0x2ca0097e988>,
 <Element body at 0x2ca00878148>,
 <Element bookstore at 0x2ca0097ee88>,
 <Element book at 0x2ca0097e608>,
 <Element title at 0x2ca00987f48>,
 <Element book at 0x2ca0097e788>,
 <Element title at 0x2ca00987fc8>,
 <Element book at 0x2ca0097e488>,
 <Element title at 0x2ca00995048>]

selector.xpath('//title/attribute::*')         # 选取 title节点的所有属性。

['eng', 'eng', 'chinese']

轴名称	结果
ancestor	选取当前节点的所有先辈（父、祖父等）。
ancestor-or-self	选取当前节点的所有先辈（父、祖父等）以及当前节点本身。
attribute	选取当前节点的所有属性。
child	选取当前节点的所有子元素。
descendant	选取当前节点的所有后代元素（子、孙等）。
descendant-or-self	选取当前节点的所有后代元素（子、孙等）以及当前节点本身。
following	选取文档中当前节点的结束标签之后的所有节点。
namespace	选取当前节点的所有命名空间节点。
parent	选取当前节点的父节点。
preceding	选取文档中当前节点的开始标签之前的所有节点。
preceding-sibling	选取当前节点之前的所有同级节点。
self	选取当前节点。

功能函数

starts-with函数

selector.xpath('//title[starts-with(@lang,"ch")]//text()')    # 选取lang值以ch开头的 title字节

['Tom']

contains函数

selector.xpath('//title[contains(@lang,"en")]//text()')       # 选取lang值包含en的 title字节

['Harry Potter', 'Learning XML']

and用法

# 选取lang值包含 en和 g的 title字节
selector.xpath('//title[contains(@lang,"en") and contains(@lang,"g")]//text()')

['Harry Potter', 'Learning XML']

文本中部分包含用法

selector.xpath('//title[contains(text(),"L")]//text()')       # 选取节点文本中包含 L 的 title节点

['Learning XML']

string用法：获取文本，返回字符串格式

info = selector.xpath('//title/ancestor::*')
strings = info[3].xpath('string(.)')
print('string函数：', strings)

texts = info[3].xpath('.//text()')
print('text函数：', texts)

string函数： 
  Harry Potter
  29.99

text函数： ['\n  ', 'Harry Potter', '\n  ', '29.99', '\n']

常见问题

XPath『不包含』应该怎么写？

假设有这样一段HTML代码：

html = '''
<html>
    <head>
        <title>测试XPath移除功能</title>
    </head>
    <body>
        <div class="post">
            <div class="quote">无关紧要的引用内容</div>
                你好啊
                <strong>产品经理</strong>，
                <span>很高兴认识你</span>
                。
        </div>
    </body>
</html>
'''

我想把其中的 你好啊产品经理，很高兴认识你 提取出来。

from lxml import etree
selector = etree.fromstring(html)
selector.xpath('//div[@class="post"]//*[not(@class="quote")]/text()')

['产品经理', '很高兴认识你']

但是这里缺少 你好啊 ，因为它不属于任何子标签。

为了单独直接获取div下面的内容，我们需要使用 |再拼接一个 XPath：

data = selector.xpath('//div[@class="post"]/text() | //div[@class="post"]//*[not(@class="quote")]/text()')
text = ''.join(map(lambda x: x.strip() , data))
text

'你好啊产品经理，很高兴认识你。'

标签套标签,如何提取成一句完整的话？

html = '''
<div id="class3">
    我左青龙,
    <span id='tiger'>
        右白虎,
        <ul>上朱雀,
            <li>下玄武.</li>
        </ul>
        老牛在当中,
    </span>
    龙头在胸口.
</div>
'''

selector = etree.HTML(html)
data = selector.xpath('//div[@id="class3"]')[0]

方法一：使用string函数

info = data.xpath('string(.)')   # 实际上是去除了div中间的其他多余标签
print(info)

    我左青龙,
    
        右白虎,
        上朱雀,
            下玄武.
        
        老牛在当中,
    
    龙头在胸口.

content2 = info.replace('\n','').replace(' ','')   # 将换行与空格分别取代
print(content2)

我左青龙,右白虎,上朱雀,下玄武.老牛在当中,龙头在胸口.

方法二：使用text函数

info = data.xpath('.//text()')
info

['\n    我左青龙,\n    ',
 '\n        右白虎,\n        ',
 '上朱雀,\n            ',
 '下玄武.',
 '\n        ',
 '\n        老牛在当中,\n    ',
 '\n    龙头在胸口.\n']

content2 = ''.join(map(lambda x: x.strip() , info))
print(content2)

我左青龙,右白虎,上朱雀,下玄武.老牛在当中,龙头在胸口.

热心市民小磊

原创文章 27 获赞 24 访问量 2万+

关注私信

适合初学者的xpath基础介绍

基础