版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/Yk_0311/article/details/82351259
Beautiful Soup 是Python的一个HTML或XML的解析库,可以用来从网页中提取数据
引用
from bs4 import BeautifulSoup
解析器
Beautiful在解析时实际上依赖解析器
以下是Beautiful Soup支持的解析器
BeautifulSoup类的基本元素
提取信息
1.Tag
import requests
from bs4 import BeautifulSoup
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
# print(soup.prettify())#清晰漂亮的打印出来
print(soup.title)
tag=soup.a
print(tag)
#输出
<title>This is a python demo page</title>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
任何存在于HTML语法中的标签都可以用soup.<tag>访问获得
如果存在多个相同的标签,那么soup.<tag>只返回第一个
2.Tag的attrs(属性)
import requests
from bs4 import BeautifulSoup
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
# print(soup.prettify())#清晰漂亮的打印出来
tag=soup.a
print(tag)
print(tag.attrs)#获得这个标签的全部的属性
print(tag.attrs['href'])#获得href属性
#输出
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
http://www.icourse163.org/course/BIT-268001
3.Tag的NavigableString(标签内非属性字符串)
import requests
from bs4 import BeautifulSoup
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
# print(soup.prettify())#清晰漂亮的打印出来
print(soup.a)
print(soup.a.string)
print(soup.p)
print(soup.p.string)
print(type(soup.a.string))
#输出
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
<p class="title"><b>The demo python introduces several python courses.</b></p>
The demo python introduces several python courses.
<class 'bs4.element.NavigableString'>
4.Tag的Comment(标签内字符串的注释部分)
基于bs4库的HTML内容遍历方法
1.下行遍历
######contents
import requests
from bs4 import BeautifulSoup
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
# print(soup.prettify())#清晰漂亮的打印出来
print(soup.body)
print(soup.body.contents)
#输出
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
可以看到返回的结果类型是列表形式,body节点里既包含了文本,又包含了节点
需要注意的是,列表中的每个元素都是body节点的直接子节点,比如第二个p节点中包含的b节点,就相当于子孙节点了
######chlidren
import requests
from bs4 import BeautifulSoup
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
# print(soup.prettify())#清晰漂亮的打印出来
print(soup.body.children)
for i,item in enumerate(soup.body.children):#返回的结果是迭代类型
print(i,item)
#enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标
#输出
<list_iterator object at 0x0000011B94A0A0B8>
0
1 <p class="title"><b>The demo python introduces several python courses.</b></p>
2
3 <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
4
###descendants
获得所有的子孙节点
import requests
from bs4 import BeautifulSoup
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
# print(soup.prettify())#清晰漂亮的打印出来
print(soup.body.descendants)
for i,item in enumerate(soup.body.descendants):#返回的结果是迭代类型
print(i,item)
#输出
<generator object descendants at 0x000001B909020C50>
0
1 <p class="title"><b>The demo python introduces several python courses.</b></p>
2 <b>The demo python introduces several python courses.</b>
3 The demo python introduces several python courses.
4
5 <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
6 Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
7 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
8 Basic Python
9 and
10 <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
11 Advanced Python
12 .
13
找到了所有的子孙节点
扫描二维码关注公众号,回复:
3319587 查看本文章
2.上行遍历
######parent
import requests
from bs4 import BeautifulSoup
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
print(soup.a.parent.prettify())
#这里我们找的是第一个a节点的父节点元素
#输出
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
#####parents
import requests
from bs4 import BeautifulSoup
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
for i,item in enumerate(soup.a.parents):
print(i,item)
#输出
0 <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
1 <body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>
2 <html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
3 <html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
3.平行遍历
以上就是基于bs4库的HTML内容遍历方法
我们可以看到结果,有时候输出的内容并不是一个节点,那该怎么办?
import bs4
if isinstance(tr, bs4.element.Tag):
#判断是不是标签
基于bs4库的HTML内容遍历方法
<>.find_all(name,attrs,recursive,string.**kwargs)
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')
#print(soup.prettify())
'''
<>.find_all(name,attrs,recursive,string.**kwargs)
name:对标签名称的检索字符串
attrs:对标签属性值的检索字符串,可以标记属性检索
recursive:是否对子孙全部检索,默认为True
string:<>...</>中字符串区域的检索字符串
'''
# name
# 检索a标签
print(1, soup.find_all('a')) # 输出了一个列表的类型
# 检索a,b标签
print(2, soup.find_all(['a', 'b'])) # 找到两个标签
# name=True
print(3, soup.find_all(True)) # 当name=True,将给出所有标签的信息
for tag in soup.find_all(True):
print(tag.name)
# attrs
# 4.1|4.2|4.3 都是一样的
print(4.1, soup.find_all('p', attrs={'class': 'course'})) # 输出了带有course属性的p标签
print(4.2, soup.find_all('p', class_='course')) # class是关键字,所以后面加了_
print(4.3, soup.find_all('p', 'course'))
#5.1|5.2 也是一样的
print(5.1, soup.find_all(attrs={"id": 'link1'})) # 输出属性中id=linkb的标签
print(5.2,soup.find_all(id='link'))
# message1 = soup.find_all("tr", attrs={"bgcolor": "#ffffff"})
# recursive
print(6, soup.find_all('a'))
print(7, soup.find_all('a', recursive=False))
# string
print(8, soup.find_all(string='Basic Python'))
print(9, soup.find_all(string=re.compile("python"))) # 使用了正则表达式,返回一个列表,里面包含有python的字符串
输出
#输出
1 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
2 [<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
3 [<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>, <head><title>This is a python demo page</title></head>, <title>This is a python demo page</title>, <body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>, <p class="title"><b>The demo python introduces several python courses.</b></p>, <b>The demo python introduces several python courses.</b>, <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
html
head
title
body
p
b
p
a
a
4.1 [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
4.2 [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
4.3 [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
5.1 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
5.2 []
6 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
7 []
8 ['Basic Python']
9 ['This is a python demo page', 'The demo python introduces several python courses.']
其他函数