文章目录
1. bs4简介
Beautiful Soup是一个可以从HTML或XML文件中提取提取数据的网页信息提取库。
首先需要安装,最好先安装pip install lxml再安装pip install bs4否则可能会出错。
bs4不需要记语法,直接调用里面的方法就可以了,这是它比正则和xpath方便的地方。
2. bs4入门
我们用一段网页文档来示例一下如何使用bs4。
from bs4 import Beautiful Soup # 先引入Beautiful Soup类,Beautiful Soup是bs4中常用的类
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
如果我们要从上面文档中用Beautiful Soup提取需要的内容,我们要先解析成bs4对象。
from bs4 import Beautiful Soup # 先引入Beautiful Soup类,Beautiful Soup是bs4中常用的类
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,features='lxml') # 这里我们传入两个对象,一个是刚才的文档,第二个是features='lxml',用来解析文档的。
print(soup) # 打印一下,得到一个Beautifull Soup对象。
结果
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
如果我想要结构更清晰一点的打印结果可以这样打印:
print(soup.prettify())
得到一个更清晰的结构树,可以方便找到各标签的关系。
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
如果我现在要把title标签的元素打印出来,可以这样操作。
print(soup.title)
得到
<title>The Dormouse's story</title>
如果要得到标签名字,和标签内的字符串,可以这样。
print(soup.title.name)
print(soup.title.string)
结果
title
The Dormouse's story
如果我要得到p标签
print(soup.p)
结果发现只找到里面三个p标签的第一个
<p class="title"><b>The Dormouse's story</b></p>
如果我都找到可以用find_all方法
res = soup.find_all('p')
print(res,len(res))
结果我们得到一个列表,返回所有的p标签作为列表的元素。发现有3各p标签,即找到了所有的p标签。
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>] 3
下面我们发现a标签里面有一个href里面有url,我们如何获取呢?可以这样
links = soup.find_all('a')
for link in links:
print(link.get('href'))
就拿到了
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
以上是我们对bs4的一些入门操作,可以看到是很方便简洁的。
3. bs4对象的种类
Tag :标签
Navigablestring :可导航的字符串
BeautifulSoup :soup对象
Comment :注释
我们来通过操作代码来认识它们。。
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
print(type(soup.title))
print(type(soup.p))
print(type(soup.a))
结果我们看到,以上三个都是Tag对象。
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
对Tag对象我们可以做以下操作
print(soup.p.name)
print(soup.p.attrs)
print(soup.p.string)
结果
p
{'class': ['title']}
The Dormouse's story
其中
print(type(soup.p.string))
得到
<class 'bs4.element.NavigableString'>
是NavigableString字符串类型,它和普通字符串一样,可以做拼接等一样的操作。
print(type(soup))
我们看到得到的是一个soup对象
<class 'bs4.BeautifulSoup'>
下面我们看看注释类型,这个并不常用。我们随便写一个注释:
html = '<a><!--新年快乐!!--></a>'
soup = BeautifulSoup(html,'lxml')
print(soup.a.string)
我们先打印一下看看效果
新年快乐!!
把注释打印出来了。我们看看类型。
html = '<a><!--新年快乐!!--></a>'
soup = BeautifulSoup(html,'lxml')
print(type(soup.a.string))
我们看到是注释类对象
<class 'bs4.element.Comment'>
好,通过以上操作我们认识了四个对象类型。
4. 遍历文档树
我们先了解一下常用的解析器:
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
python标准库 | BeautifulSoup(markup,“html.parser”) | python的内置标准库,执行速度适中,文档容错能力强 | python 2.7.3以前的版本容错能力差 |
lxml HTML解析器 | BeautifulSoup(markup,“lxml”), | 速度快,容错能力强 | 需要安装C语言库 |
lxml XML解析器 | BeautifulSoup(markup,[“lxml”,“xml”])或者BeautifulSoup(markup,“xml”) | 速度快,唯一支持xml的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup,“html5lib”) | 最好的容错性,不依赖外部的扩展,以浏览器的方式解析文档,生成HTML5格式的文档 | 速度慢 |
推荐使用lxml解析器,因为效率更高。当然我们可以根据具体需求更换解析器。
4.1 contents, children, descendants
contents 返回的是所有子节点的列表
children 返回的是一个子节点的迭代器
descendants 返回的是一个生成器,遍历子子孙孙
看代码:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
print(soup.contents,type(soup.contents))
print('-*'*60)
print(soup.children,type(soup.children))
print('-*'*60)
print(soup.descendants,type(soup.descendants))
print('-*'*60)
结果我们看到第一个得到的是一个列表,第二个得到的是一个迭代器,第三个是一个生成器。
[<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>] <class 'list'>
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
<list_iterator object at 0x000001C65451CA60> <class 'list_iterator'>
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
<generator object Tag.descendants at 0x000001C653EB1580> <class 'generator'>
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
我们举个例子更直观的观察:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
html_Tag = soup.html
print(html_Tag.contents)
结果我们看到,返回的是所有子节点的一个列表,还包含了其中的一个换行符。我们就可以用列表的操作方法遍历取得里面的元素。
[<head><title>The Dormouse's story</title></head>, '\n', <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>]
第二个既然是迭代器,我们就可以通过遍历取出其中的元素:
soup = BeautifulSoup(html_doc,'lxml')
html_Tag = soup.html
for i in html_Tag.children:
print(i)
print('*'*100)
结果我们看到得到两个子节点,由于中间有个换行符,所以有一个空的位置打印出来。
<head><title>The Dormouse's story</title></head>
****************************************************************************************************
# 这里原本有个换行符
****************************************************************************************************
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
****************************************************************************************************
第三个是生成器,我们也可以遍历
soup = BeautifulSoup(html_doc,'lxml')
html_Tag = soup.html
for i in html_Tag.descendants:
print(i)
print('*'*100)
结果我们看到是这样的,所有的子节点,子节点的子节点都被遍历出来了。像剥洋葱一样。
<head><title>The Dormouse's story</title></head>
****************************************************************************************************
<title>The Dormouse's story</title>
****************************************************************************************************
The Dormouse's story
****************************************************************************************************
****************************************************************************************************
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
****************************************************************************************************
****************************************************************************************************
<p class="title"><b>The Dormouse's story</b></p>
****************************************************************************************************
<b>The Dormouse's story</b>
****************************************************************************************************
The Dormouse's story
****************************************************************************************************
****************************************************************************************************
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
****************************************************************************************************
Once upon a time there were three little sisters; and their names were
****************************************************************************************************
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
****************************************************************************************************
Elsie
****************************************************************************************************
,
****************************************************************************************************
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
****************************************************************************************************
Lacie
****************************************************************************************************
and
****************************************************************************************************
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
****************************************************************************************************
Tillie
****************************************************************************************************
;
and they lived at the bottom of a well.
****************************************************************************************************
****************************************************************************************************
<p class="story">...</p>
****************************************************************************************************
...
****************************************************************************************************
****************************************************************************************************
空白位置都是因为有换行符。
4.2 string, strings, stripped_strings
string 获取标签里面的内容
strings 返回是个生成器,用来获取多个标签的内容
stripped_strings 和strings的作用一样,只不过去除内容里多余的空格
例如,如果我们项获得title标签里的内容可以这样:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
title_Tag = soup.title
print(title_Tag.string)
结果就得到了title标签里的内容
The Dormouse's story
下面我们再看
soup = BeautifulSoup(html_doc,'lxml')
html_Tag = soup.html
s = html_Tag.strings
for i in s: # 由于s的结果是个生成器,所以我们用遍历的方法取其中的内容。
print(i)
结果
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...
所有的字符串内容都得到了
再看一个
soup = BeautifulSoup(html_doc,'lxml')
# title_Tag = soup.title
# print(title_Tag.string)
html_Tag = soup.html
s = html_Tag.stripped_strings
for i in s:
print(i)
结果发现那些多余的空格不见了
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...
这样很方便取出内容的整洁。
4.3 parent, parents
parent 直接获取父节点
parents 获取所有的父节点
例子
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
title_Tag = soup.title
print(title_Tag.parent)
结果
<head><title>The Dormouse's story</title></head>
再看parents,由于结果是个生成器,我们用遍历
for i in title_Tag.parents:
print(i)
结果所有的父节点都出来了
<head><title>The Dormouse's story</title></head>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
4.4 兄弟节点
next_sibling 下一个兄弟节点
previous_sibling 上一个兄弟节点
next_siblings 下一个所有的兄弟节点
previous_siblings 上一个所有的兄弟节点
例子
from bs4 import BeautifulSoup
html_tex = '<a><b>bbb</b><c>ccc</c><d>dddd</d></a>'
soup2 = BeautifulSoup(html_tex,'lxml')
b_Tag = soup2.b
d_Tag = soup2.d
print(b_Tag.next_sibling)
print(d_Tag.previous_sibling)
print(b_Tag.next_siblings)
print(d_Tag.previous_siblings)
结果,后面两个是生成器
<c>ccc</c>
<c>ccc</c>
<generator object PageElement.next_siblings at 0x0000021679FA1510>
<generator object PageElement.previous_siblings at 0x0000021679FA1510>
我们对后面两个遍历一下
for i in b_Tag.next_siblings:
print(i)
print('-*'*20)
for j in d_Tag.previous_siblings:
print(j)
两个结果中间用线隔开了
<c>ccc</c>
<d>dddd</d>
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
<c>ccc</c>
<b>bbb</b>
5. 重点知识find与find_all
我们用正则表达式里面compile方法编译一个正则表达式传给 find 或者 findall这个方法可以实现一个正则表达式的一个过滤器的搜索。find只返回第一个, findall返回所有结果以列表形式返回。
示例html文档:
html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
<tbody>
<tr class="h">
<td class="l" width="374">职位名称</td>
<td>职位类别</td>
<td>人数</td>
<td>地点</td>
<td>发布时间</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师(深圳)</a></td>
<td>技术类</td>
<td>4</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
</tbody>
</table>
"""
5.1 获取所有的tr标签
from bs4 import BeautifulSoup
# html = '这里是示例代码省略'
soup = BeautifulSoup(html,'lxml')
trs = soup.find_all('tr')
print(trs)
结果就打印出了所有的tr标签,以列表形式返回
[<tr class="h">
<td class="l" width="374">职位名称</td>
<td>职位类别</td>
<td>人数</td>
<td>地点</td>
<td>发布时间</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218" target="_blank">22989-金融云区块链高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>, <tr class="odd">
<td class="l square"><a href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218" target="_blank">22989-金融云高级后台开发</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218" target="_blank">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>, <tr class="odd">
<td class="l square"><a href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218" target="_blank">SNG16-腾讯音乐业务运维工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218" target="_blank">TEG03-高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="odd">
<td class="l square"><a href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218" target="_blank">TEG03-高级图像算法研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218" target="_blank">TEG11-高级AI开发工程师(深圳)</a></td>
<td>技术类</td>
<td>4</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="odd">
<td class="l square"><a href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218" target="_blank">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218" target="_blank">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="odd">
<td class="l square"><a class="test" href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218" id="test" target="_blank">SNG11-高级业务运维工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>]
5.2 获取第二个tr标签
用列表知识提取目标
tr_2 = soup.find_all('tr')[1]
print(tr_2)
结果就打印去第二个tr标签
<tr class="even">
<td class="l square"><a href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218" target="_blank">22989-金融云区块链高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
5.3 获取所有class="even"的tr标签
前面类似的代码省略,这里为节省篇幅只写关键代码
trs = soup.find_all('tr',class_="even")
print(trs)
结果就打印出了所有的class="even"的tr标签。
[<tr class="even">
<td class="l square"><a href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218" target="_blank">22989-金融云区块链高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218" target="_blank">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218" target="_blank">TEG03-高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218" target="_blank">TEG11-高级AI开发工程师(深圳)</a></td>
<td>技术类</td>
<td>4</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218" target="_blank">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>]
说明:这里因为class在python中是类的特别标识符,所以这里用“class_”来代替“class”。
5.4 提取所有id="test"且class="test"的a标签
lst = soup.find_all('a',class_="test",id="test")
print(lst)
结果就提取出来了我们要的目标
[<a class="test" href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218" id="test" target="_blank">SNG11-高级业务运维工程师(深圳)</a>]
5.5 获取所有a标签的href属性
a_lst = soup.find_all('a')
for a in a_lst:
href = a.get('href')
print(href)
结果就提取出了所有的href值
position_detail.php?id=33824&keywords=python&tid=87&lid=2218
position_detail.php?id=29938&keywords=python&tid=87&lid=2218
position_detail.php?id=31236&keywords=python&tid=87&lid=2218
position_detail.php?id=31235&keywords=python&tid=87&lid=2218
position_detail.php?id=34531&keywords=python&tid=87&lid=2218
position_detail.php?id=34532&keywords=python&tid=87&lid=2218
position_detail.php?id=31648&keywords=python&tid=87&lid=2218
position_detail.php?id=32218&keywords=python&tid=87&lid=2218
position_detail.php?id=32217&keywords=python&tid=87&lid=2218
position_detail.php?id=34511&keywords=python&tid=87&lid=2218
这一方法以后可以用来获取网页的url
除了上述的写法,还可以这么写:
a_lst = soup.find_all('a')
for a in a_lst:
href = a['href']
print(href)
结果一模一样,就不再列出。
5.6 获取所有职位信息的文本
通过观察我们发现,除了第一个tr标签内没有职位信息,其他的tr标签内都有职位信息。那么我就可以这样写:
trs = soup.find_all('tr')[1:]
for tr in trs:
position = tr.find_all('td')[0].string
print(position)
结果,我们就提取出来所有的文本信息
22989-金融云区块链高级研发工程师(深圳)
22989-金融云高级后台开发
SNG16-腾讯音乐运营开发工程师(深圳)
SNG16-腾讯音乐业务运维工程师(深圳)
TEG03-高级研发工程师(深圳)
TEG03-高级图像算法研发工程师(深圳)
TEG11-高级AI开发工程师(深圳)
15851-后台开发工程师
15851-后台开发工程师
SNG11-高级业务运维工程师(深圳)
以后我们讲的案例几乎都是用find()和find_all()方法。
后面我们会讲select()方法,需要预习一下css语法,提供一个链接:
https://www.w3school.com.cn/cssref/css_selectors.asp
这是相关知识的参考。
以上就是本节所有内容。