BeautifulSoup
本身最强大的功能是文档树的搜索;
但也可以修改文档树。
1 修改tag的名称和属性
soup = BeautifulSoup( '<b class="boldest">Extremely bold</b>' , 'html.parser' )
tag = soup. b
print ( f"修改前: {
tag} " )
tag. name = "blockquote"
tag[ 'class' ] = 'verybold'
tag[ 'id' ] = 1
print ( f"修改后: {
tag} " )
del tag[ 'class' ]
del tag[ 'id' ]
print ( f"删除后: {
tag} " )
修改前:< b class = "boldest" > Extremely bold< / b>
修改后:< blockquote class = "verybold" id = "1" > Extremely bold< / blockquote>
删除后:< blockquote> Extremely bold< / blockquote>
2 修改 .string
给tag
的 .string
属性赋值,就相当于用当前的内容替代了原来的内容;
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup( markup, 'html.parser' )
tag = soup. a
tag. string = "New link text."
print ( tag)
< a href= "http://example.com/" > New link text. < / a>
3 append()
Tag.append()
方法是给tag
中添加内容;
soup = BeautifulSoup( "<a>Foo</a>" , 'html.parser' )
soup. a. append( "Bar" )
print ( soup)
print ( soup. a. contents)
< a> FooBar< / a>
[ 'Foo' , 'Bar' ]
4 NavigableString() 和 .new_tag()
添加一段文本内容到文档中,使用NavigableString()
;
创建一段注释或 NavigableString
的任何子类, 只要调用 NavigableString
;
创建一个tag最好的方法是调用工厂方法 BeautifulSoup.new_tag()
;
soup = BeautifulSoup( "<b></b>" , 'html.parser' )
original_tag = soup. b
new_tag = soup. new_tag( "a" , href= "http://www.example.com" )
original_tag. append( new_tag)
print ( original_tag)
new_tag. string = "Link text."
print ( original_tag)
< b> < a href= "http://www.example.com" > < / a> < / b>
< b> < a href= "http://www.example.com" > Link text. < / a> < / b>
5 insert()
Tag.insert()
方法与 Tag.append()
方法类似;
区别是不会把新元素添加到父节点 .contents
属性的最后;
而是把元素插入到指定的位置。
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup( markup, 'html.parser' )
tag = soup. a
tag. insert( 1 , "but did not endorse " )
print ( tag)
print ( tag. contents)
< a href= "http://example.com/" > I linked to but did not endorse < i> example. com< / i> < / a>
[ 'I linked to ' , 'but did not endorse ' , < i> example. com< / i> ]
6 insert_before() 和 insert_after()
insert_before()
方法在当前tag
或文本节点前插入内容;
insert_after()
方法在当前tag
或文本节点后插入内容;
7 clear()
Tag.clear() 方法移除当前tag的内容;
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup( markup, 'html.parser' )
tag = soup. a
tag. clear( )
print ( tag)
< a href= "http://example.com/" > < / a>
8 其他几个方法
方法
说明
PageElement.extract()
将当前tag
移除文档树,并作为方法结果返回
Tag.decompose()
将当前节点移除文档树并完全销毁
PageElement.replace_with()
移除文档树中的某段内容,并用新tag
或文本节点替代它
PageElement.wrap()
可以对指定的tag
元素进行包装 ,并返回包装后的结果
Tag.unwrap()
将移除tag
内的所有tag
标签
9 本文涉及的源码
from bs4 import BeautifulSoup
soup = BeautifulSoup( '<b class="boldest">Extremely bold</b>' , 'html.parser' )
tag = soup. b
print ( f"修改前: {
tag} " )
tag. name = "blockquote"
tag[ 'class' ] = 'verybold'
tag[ 'id' ] = 1
print ( f"修改后: {
tag} " )
del tag[ 'class' ]
del tag[ 'id' ]
print ( f"删除后: {
tag} " )
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup( markup, 'html.parser' )
tag = soup. a
tag. string = "New link text."
print ( tag)
soup = BeautifulSoup( "<a>Foo</a>" , 'html.parser' )
soup. a. append( "Bar" )
print ( soup)
print ( soup. a. contents)
soup = BeautifulSoup( "<b></b>" , 'html.parser' )
original_tag = soup. b
new_tag = soup. new_tag( "a" , href= "http://www.example.com" )
original_tag. append( new_tag)
print ( original_tag)
new_tag. string = "Link text."
print ( original_tag)
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup( markup, 'html.parser' )
tag = soup. a
tag. insert( 1 , "but did not endorse " )
print ( tag)
print ( tag. contents)
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup( markup, 'html.parser' )
tag = soup. a
tag. clear( )
print ( tag)