bs不基于正则,而是基于网页的结构和属性。
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')#补全更正格式
print(soup.prettify())#自动缩进
print(soup.title)#特定节点,只取第一个节点!!!!!!!!!
print(soup.title.string)#特定节点的文本
结果如下
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
<title>The Dormouse's story</title>
The Dormouse's story
类似于soup.titlle.string
string 文本
name 名称
attrs 属性
print(soup.p)
print(soup.p.string)
print(soup.p.attrs)
print((soup.p.attrs['name']))
print(soup.p['class'])
<p class="title li" name="dromouse"><b>The Dormouse's story</b></p>
The Dormouse's story
{'name': 'dromouse', 'class': ['title']}
dromouse
['title']
children 直接子节点
descendants 所有子孙节点
parent 直接父节点
parents 所有祖先节点
previous_sibling 前一个兄弟节点
next_sibling 后一个兄弟节点
previous_siblings 前面所有兄弟节点
next_siblings 后面所有兄弟节点
其中只有parent返回文本(因其是唯一的),其他均返回生成器。
print(list(enumerate(soup.p.parents)))
这里把iterator类型转换成enumerate类型,然后用转成列表输出。
其中0元素为其父节点body,然后是body的父节点html,最后是全文html,所以html输出了两遍。
[(0, <body>
<p class="title li" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>), (1, <html><head><title>The Dormouse's story</title></head>
<body>
<p class="title li" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>), (2, <html><head><title>The Dormouse's story</title></head>
<body>
<p class="title li" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>)]
find_all 匹配所有结果
匹配class属性,需要加下划线,即class_,因为class本身是python关键字
print(soup.find_all(class_='sister'))
用节点name匹配
print(soup.find_all(name='p'))
对于节点内部属性,叫name的,不可直接匹配,需要标注是属性,否则会匹配节点名。
print(soup.find_all(attrs={'name':'dromouse'}))