在解析网页的时候,相比于使用正则表达式,PyQuery用起来既方便又快捷。
初始化
初始化的时候一般有三种传入方式:传入字符串,传入url,传入文件
from pyquery import PyQuery as pq
content = requests.get('https://book.douban.com/').text
doc = pq(content) #传入字符串
doc = pq('https://book.douban.com/') #传入url
观察第一行你可以发现,由于PyQuery写起来比较麻烦,所以我们导入的时候都会添加别名
CSS选择器
上述代码中的doc其实就是一个pyquery对象,我们可以通过doc可以进行元素的选择,其实这里就是一个CSS选择器,所以CSS选择器的规则都可以用
获取所有标签:doc(‘标签名’)
获取所有class:doc(’.class_name’)
获取所有id:doc(’#id_name’)
…
html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
'''
items = doc('#containner').items() #返回一个generator,包含所有匹配的元素
for item in items:
print(item)
'''
print(doc('#container')) #只返回第一个匹配的元素
print()
print(doc('.item-0'))
print()
print(doc('a'))
output:
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
-------------------------------------------------------------------------------------------------------------------------------
<li class="item-0">first item</li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
--------------------------------------------------------------------------------------------------------------------------------
<a href="link2.html">second item</a>
<a href="link3.html"><span class="bold">third item</span></a>
<a href="link4.html">fourth item</a>
<a href="link5.html">fifth item</a>
PyQuery对象查找满足层级关系的内容
html = '''略'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list .item-1')) #层级用空格隔开
查找元素
1.子元素
doc是一个PyQuery对象,doc.find(' ')返回的仍然是一个PyQuery对象
html = '''略'''
from pyquery import PyQuery as pq
doc = pq(html)
print(type(doc))
items = doc('.item-0')
print(type(items))
print(items.find('a'))
output:
<class 'pyquery.pyquery.PyQuery'>
<class 'pyquery.pyquery.PyQuery'>
<a href="link3.html"><span class="bold">third item</span></a><a href="link5.html">fifth item</a>
2.父元素
显示父元素的域
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('a').parent())
output:
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
除了parent方法,还有parents方法:返回祖先元素的域(向上追溯到底)
3.兄弟元素
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-1.active') #第二个元素没有空格,表示item-1&active
print(li.siblings())
4.遍历
items()返回一个生成器
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-1')
gen = li.items() #返回一个generator
for x in gen:
print(x)
output:
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
5.获取信息
获取属性:
用attr('attribute_name')获取标签的属性,其中str表示属性名。
获取文本:
文本指被html标签包含的文字信息,即 <>text<>
html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
attribute = doc('.item-0.active a').attr('href')
print(attribute)
text = doc('.item-1.active a').text()
print(text)
output:
link3.html
fourth item
6.DOM操作
对html代码进行修改
(1)addClass(‘class_name’)、removeClass(‘class_name’);
(2)attr()、css();
(3)remove()
html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
item_0 = doc('.item-0')
item_0.remove()
print(doc)