一、BeautifulSoup介绍
Beautiful Soup就是Python的一个HTML或XML的解析库,可以用它来方便地从网页中提取数据。Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为UTF-8编码。
二、BeautifulSoup简单案例
from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html,'lxml') print(soup.prettify()) print(soup.title) print(soup.title.name) print(soup.title.string) print(soup.title.parent.name) print(soup.p) print(soup.p["class"]) print(soup.a) print(soup.find_all('a')) print(soup.find(id='link3'))
三、Beautiful Soup支持的解析器
解析器 |
使用方法 |
优势 |
劣势 |
Python标准库 |
BeautifulSoup(markup, "html.parser") |
Python的内置标准库、执行速度适中、文档容错能力强 |
Python 2.7.3及Python 3.2.2之前的版本文档容错能力差 |
lxml HTML解析器 |
BeautifulSoup(markup, "lxml") |
速度快、文档容错能力强 |
需要安装C语言库 |
lxml XML解析器 |
BeautifulSoup(markup, "xml") |
速度快、唯一支持XML的解析器 |
需要安装C语言库 |
html5lib |
BeautifulSoup(markup, "html5lib") |
最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 |
速度慢、不依赖外部扩展 |
四、BeautifulSoup基本用法
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
1.标签选择器
通过这种soup.标签名 我们就可以获得这个标签的内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(soup.head)
print(soup.p) # 如果有多个p标签,只输出第一个
2.标签选择器·获取名称
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name)
3.标签选择器·获取属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
4.子节点和子孙节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents) # 获取子节点
print(soup.p.children) # 获取子节点
for i,child in enumerate(soup.p.children):
print(i,child)
print(soup.p.descendants) # 获取子孙节点
for i,child in enumerate(soup.p.descendants):
print(i,child)
5.父节点、祖先节点、兄弟节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent) # 获取父节点
print(list(enumerate(soup.a.parents))) # 获取祖先节点
print(list(enumerate(soup.a.next_siblings))) # 获取下一兄弟节点
print(list(enumerate(soup.a.previous_siblings))) # 获取上一个兄弟节点
五、方法选择器
find_all()根据标签名、属性、内容查找文档
find_all(narne,attrs,recursive,text,**kwargs)
# 标签名查询
print(soup.findall(name=’ul'))
print(type(soup.find_all(name=’ul’)[0]))
# 属性查询
print(soup.干ind_all(attrs={’id':’list-1'}))
print(soup.于ind_all(attrs={’name':’elements’}))
# 文本查询
print(soup.find_all(text=re.compile(’link')))
find_all() # 返回所有元素
find() # 返回单个元素
find_parents() # 返回所有祖先节点 find_parent() # 返回直接父节点 find_next_siblings() # 返回后面所有的兄弟节点 find_next_sibling() # 返回后面第一个兄弟节点 find_previous_siblings() # 返回前面所有兄弟节点 find_previous_sibling() # 返回前面第一个兄弟节点 find_all_next() # 返回节点后所有符合条件的节点 find_next() # 返回第一个符合条件的节点 find_all_previous() # 返回节点后所有符合条件的节点 find_previous() # 返回第一个符合条件的节点
六、CSS选择器
通过select()直接传入CSS选择器即可完成选择
html= ''' <div class='panel'> <div class='panel-heading'> <h4>Hello</h4> </div> <div class='panel-body'> <ul class='list' id='list-1'> <li class='element'>Foo</li> <li class='element'>Bar> <li class='element'>Jay</li> </ul> <ul class='list list-small' id='list-2'> <li class='element'>Foo</li> <li class='element'>Bar</li> </ul> </div> </div> '''
1.选择标签
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, ’lxml' )
print(soup.select('.panel.panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2.element'))
2.选择属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, ’lxml' )
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
3.选择文本
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, ’lxml' )
for ul in soup.select('li'):
print(ul.get_text())