用PyQuery实现网页解析

在解析网页的时候,相比于使用正则表达式,PyQuery用起来既方便又快捷。

初始化

初始化的时候一般有三种传入方式:传入字符串,传入url,传入文件

from pyquery import PyQuery as pq

content = requests.get('https://book.douban.com/').text
doc = pq(content)	#传入字符串
doc = pq('https://book.douban.com/')	#传入url

观察第一行你可以发现,由于PyQuery写起来比较麻烦,所以我们导入的时候都会添加别名


CSS选择器

上述代码中的doc其实就是一个pyquery对象,我们可以通过doc可以进行元素的选择,其实这里就是一个CSS选择器,所以CSS选择器的规则都可以用
获取所有标签:doc(‘标签名’)
获取所有class:doc(’.class_name’)
获取所有id:doc(’#id_name’)

html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
'''
items = doc('#containner').items()	#返回一个generator,包含所有匹配的元素
for item in items:
	print(item)
'''
print(doc('#container'))	#只返回第一个匹配的元素
print()
print(doc('.item-0'))
print()
print(doc('a'))	
output:
<div id="container">
<ul class="list">
     <li class="item-0">first item</li>
     <li class="item-1"><a href="link2.html">second item</a></li>
     <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
     <li class="item-1 active"><a href="link4.html">fourth item</a></li>
     <li class="item-0"><a href="link5.html">fifth item</a></li>
 </ul>
 </div>
 -------------------------------------------------------------------------------------------------------------------------------
<li class="item-0">first item</li>
     <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
     <li class="item-0"><a href="link5.html">fifth item</a></li>
--------------------------------------------------------------------------------------------------------------------------------
<a href="link2.html">second item</a>
<a href="link3.html"><span class="bold">third item</span></a>
<a href="link4.html">fourth item</a>
<a href="link5.html">fifth item</a>

PyQuery对象查找满足层级关系的内容

html = '''略'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list .item-1'))	#层级用空格隔开

查找元素

1.子元素
doc是一个PyQuery对象,doc.find(' ')返回的仍然是一个PyQuery对象
html = '''略'''
from pyquery import PyQuery as pq
doc = pq(html)
print(type(doc))
items = doc('.item-0')
print(type(items))
print(items.find('a'))
output:
<class 'pyquery.pyquery.PyQuery'>
<class 'pyquery.pyquery.PyQuery'>
<a href="link3.html"><span class="bold">third item</span></a><a href="link5.html">fifth item</a>
2.父元素

显示父元素的域

from pyquery import PyQuery as pq
doc = pq(html)
print(doc('a').parent())
output:
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

除了parent方法,还有parents方法:返回祖先元素的域(向上追溯到底)

3.兄弟元素
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-1.active')    #第二个元素没有空格,表示item-1&active
print(li.siblings())
4.遍历
items()返回一个生成器
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-1')
gen = li.items()  #返回一个generator
for x in gen:
    print(x)
output:
<li class="item-1"><a href="link2.html">second item</a></li> 
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
5.获取信息

获取属性:

用attr('attribute_name')获取标签的属性,其中str表示属性名。

获取文本:

文本指被html标签包含的文字信息,即 <>text<>
html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
attribute = doc('.item-0.active a').attr('href')
print(attribute)
text = doc('.item-1.active a').text()
print(text)
output:
link3.html
fourth item
6.DOM操作

对html代码进行修改
(1)addClass(‘class_name’)、removeClass(‘class_name’);
(2)attr()、css();
(3)remove()

html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
item_0 = doc('.item-0')
item_0.remove()
print(doc)

猜你喜欢

转载自blog.csdn.net/masami269981/article/details/89297366