1. 安装
pip bs4或pip beautifulsoup4
2. 使用
- 创建Beautiful Soup 对象
from bs4 import BeautifulSoup
soup=BeautifulSoup(str,‘lxml’)//str在下面的测试代码中
- 四大对象种类
Beautiful Soup 将复杂HTML 文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种
- Tag
- NavigatableString
- BeautifulSoup
- Comment
2.1 Tag
Tag,就是HTML中的一个标签,比如,等,注意:相同的标签只能获取第一个符合要求的标签 - 获取标签
soup=BeautifulSoup(str,‘lxml’)
print(soup.title)
- 获取标签属性
print(soup.div.attrs)
print(soup.div.get(‘class’))
print(soup.div[‘class’])
print(soup.a[‘href’])
2.2 NavigatableString 获取内容
print(soup.strong.string)
print(soup.strong.text)
2.3 BeautifulSoup
BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候,可以把它当作Tag对象,支持遍历,索索等方法
2.4 Comment
if(type(soup.strong.string)==Comment):
//print(soup.strong.text)
print(soup.strong.prettify())
else:
print(type(soup.strong.string))
-
搜索文档树
3.1 过滤器find_all() -
代码
from bs4 import BeautifulSoup
from bs4.element import Comment
str='''//这里举例用的,一般使用bs都是用爬来的代码,无需我们自己写
<title id="title">西邮</title>
<div class="info" float="left">welcome to xupt</div>
<div class="info" float="right">
<span>Good Study</span>
<a href="www.baidu.com"></a>
<strong><!--注释!--></strong>
</div>
'''
soup=BeautifulSoup(str,'lxml')
printf('--------------测试--------------')//下面注释就是的输出结果
print(soup.title) //<title id="title">西邮</title>
print(soup.div) //<div class="info" float="left">welcome to xupt</div>
print(soup.div.attrs)/ /{'class': ['info'], 'float': 'left'}
print(soup.div.get('class')) //['info']
print(soup.div['class']) //['info']
print(soup.div.text) //welcome to xupt
print(soup.div.string) //welcome to xupt
print(soup.a['href']) //www.baidu.com
print(soup.strong.string) //注释!
print(type(soup.strong.string)) //<class 'bs4.element.Comment'>
if(type(soup.strong.string)==Comment):
//print(soup.strong.text)
print(soup.strong.prettify())
else:
print(type(soup.strong.string))
/*结果
<strong>
<!--注释!-->
</strong>
*/
print('------------------find_all------------------')
print(soup.find_all('title')) //[<title id="title">西邮</title>]
print(soup.find_all(id='title')) //[<title id="title">西邮</title>]
print(soup.find_all(class_='info'))
/*
[<div class="info" float="left">welcome to xupt</div>, <div class="info" float="right">
<span>Good Study</span>
<a href="www.baidu.com"></a>
<strong><!--注释!--></strong>
</div>]
*/
print(soup.find_all("div",attrs={'float':'left'})) //[<div class="info" float="left">welcome to xupt</div>]
print('-------------------css()--------------------')
print(soup.select('title')) //[<title id="title">西邮</title>]
print(soup.select('#title')) //[<title id="title">西邮</title>]
print(soup.select('.info'))
/*
[<div class="info" float="left">welcome to xupt</div>, <div class="info" float="right">
<span>Good Study</span>
<a href="www.baidu.com"></a>
<strong><!--注释!--></strong>
</div>]
*/
print(soup.select('div span')) //[<span>Good Study</span>]
print(soup.select('div > span')) //[<span>Good Study</span>]
print(soup.select('div')[1].select('a')) //[<a href="www.baidu.com"></a>]
print(soup.select('title')[0].text) //西邮