爬虫学习——（三）Beautiful Soup的使用

1.Beautiful Soup简介

2.解析器

3.安装Beautiful Soup

1.Beautiful Soup简介

前面学习通过正则表达式提取网页信息时，如果正则表达式出现错误则无法正确提取我们所需要的结果。由于网页有一定的特殊和层级关系，利用强大的解析工具——Beautiful Soup能够借助网页的结构和属性等特性来解析网页，相比于正则表达式，它可以利用更简单的语句提取网页内容。

简单来说，Beautiful Soup是Python的一个HTML或XML的解析库，我们用它可以方便地从网页中提取数据，其官方解释如下：

2.解析器

通过对比不同解析器可以看出，LXML解析器有解析HTML和XML的功能，而且速度快，容错能力强，推荐使用。在使用LXML解析器时，只需要在初始化Beautiful Soup时，将第二个参数修改为lxml即可。

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>hello</p>','lxml')
print(soup.p.string)

运行结果：

hello

3.安装Beautiful Soup

在使用之前确保已经正确安装好Beautiful Soup和lxml两个库。在cmd里直接pip安装即可，命令如下：

pip install beautifulsoup4

pip install lxml

4.基本使用

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())  #自动补全代码 容错处理
print(soup.title.string)  #返回title的内容

运行结果：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

首先声明变量html字符串，但是需要注意的是这并不是一个完整的html字符串。接着将它作为第一个参数传给BeautifulSoup对象，第二个参数为解析器的类型（设置为lxml），此时完成BeautifulSoup对象的初始化，接着将这个对象赋值给soup变量。之后，就可以调用soup的各个方法和属性解析这串html代码了。

①调用prettify方法。对不标准的html字符串自动更正格式。

②调用soup.title.string。输出HTML中title节点的文本内容。

5.节点选择器

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.title) #输出title节点的选择结果
print(type(soup.title)) #输出title节点的类型
print(soup.title.string) #输出title节点里面的文字内容
print(soup.head)  #输出head节点
print(soup.p) #输出第一个p标签的内容

运行结果：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

【注】bs4.element.Tag是BeautifulSoup中一个重要的数据结构，经过选择器选择的结果都是这种Tag类型。

6.提取信息

#下面皆由这段html文本为例：
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")

获取名称

利用name属性可以获得节点的名称，先调用节点然后再调用name属性即可获得节点名称：

print(soup.title.name)

运行结果：

title

获取属性

一个节点可能有多个属性如class、id等，选择节点后可以调用attrs获取其所有属性

print(soup.p.attrs)

运行结果：

{'class': ['title'], 'name': 'dromouse'}

调用attrs属性返回结果是字典形式，包括属性和属性值，想要获取属性值，如下：

print(soup.p.attrs['name'])

运行结果：

dromouse

除此之外，还有更简便获得属性值的方法，如下：

print(soup.p['class'])
print(soup.p['name'])

运行结果：

['title']
dromouse

在这里需要注意的是，class属性返回的是列表，而name属性返回的是字符串。因为name属性的值是唯一的于是返回的结果就是单个字符串。而一个节点元元素可能包含多个class，所以返回的就是列表。在实际处理的过程中，需要注意这个问题。

获取内容

前面也使用过，利用string属性获取节点元素包含的文本内容，如下：

print(soup.p.string)

运行结果：

The Dormouse's story

嵌套选择

返回类型为bs4.element.Tag，tag对象同样可以继续调用节点进行下一步选择：

html = """<html><head><title>The Dormouse's story</title></head>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

运行结果：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

7.关联选择

子节点和子孙节点

①调用contents属性

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)

运行结果：

['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']

由结果可以看到，返回的是列表形式，p节点里既包含了文本也包含了节点，这些内容最终以列表的形式返回。

但是需要注意的是，列表中的每个元素都是p节点的直接子节点。像第一个a节点里面包含的span节点，就相当于子孙节点，但是返回的内容没有将span节点单独选出来。所以contents属性得到的结果是直接子节点组成的列表。

调用children属性得到相应的结果：

②调用children属性

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(i,child)

运行结果：

<list_iterator object at 0x0000024B05CE8AC0>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.

③调用descendants属性

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(i,child)

运行结果：

<generator object Tag.descendants at 0x0000024B064982E0>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.

可以发现，返回结果跟children属性一样是生成器。利用for循环遍历输出可以看到，此时输出的结果中包含了span节点，因为descendants会递归查询所有子节点，得到所有子孙节点，

父节点和祖先节点

调用parent和parents属性

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parents)))

运行结果：

[(0, <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body>), (2, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>), (3, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>)]

兄弟节点

调用next_sibling和previous_sibling属性

html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            Hello
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('next_sibling:',soup.a.next_sibling)
print('previous_sibling:',soup.a.previous_sibling)
print('next_siblings:',list(enumerate(soup.a.next_siblings)))
print('previous_siblings:',list(enumerate(soup.a.previous_siblings)))

运行结果：

next_sibling: 
            Hello
            
previous_sibling: 
            Once upon a time there were three little sisters; and their names were
            
next_siblings: [(0, '\n            Hello\n            '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '\n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
previous_siblings: [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]

分别调用了4个属性，next_sibling和previous_sibling属性分别用于获取节点的上一个和下一个兄弟节点，next_siblings和previous_siblings属性则分别返回前面和后面所有的兄弟节点

提取信息

通过上面关联选择，提取想要的信息

html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling:')
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print('-----------------------------')
print('parent:')
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

Next Sibling:
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
-----------------------------
parent:
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">Bob</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
['story']

8.方法选择器

find_all

顾名思义就是查询所有符合条件的元素。

find_all(name, attrs, recursive, text, **kwargs)

name

可以根据name参数查询元素

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))

运行结果：

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>

返回结果为列表类型，长度为2，列表中的元素都是bs4.element.Tag类型。接下来我们可以遍历每个li节点，并获取他的文本内容：

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

运行结果：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar

attrs

除了根据节点名查询，我们也可以传入一些属性进行查询：

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={"id":"list-1"}))
print(soup.find_all(attrs={"name":"elements"}))

运行结果：

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

查询时传入attrs参数，其属于字典类型。

对于一些常用的属性，例如id，class等，我们可以不用attrs传递，换一种方式查询：

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))   #因为class是python里的关键字，注意使用下划线

运行结果：

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

text

text参数可以用来匹配节点的文本，其传入的形式可能是字符串，也可以是正则表达式对象：

import re
html='''
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text = re.compile('link')))
print(soup.find_all(text = re.compile('Hello')))

返回结果是由所有与正则表达式相匹配的节点文本组成的列表。

find

除了find_all方法，还有find方法也可以查询符合条件的元素，只不过find方法返回的是单个元素，也就是第一个匹配的元素，而find_all则返回所有匹配元素组成的列表。find用法跟find_all完全相同，区别在于查询范围不同，这里就不一一实现了

9.CSS选择器

CSS选择器需要调用select方法，传入相应的CSS选择器即可。

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))  #查找panel里的panel-heading内容
print(soup.select('ul li'))  #查找li标签
print(soup.select('#list-2 .element')) #查找id为list-2的element内容
print(type(soup.select("ul")[0]))  #查看ul列表元素的类型

运行结果：

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

嵌套选择

select方法支持嵌套选择，实例如下：

from bs4 import BeautifulStoneSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select("ul"):
    print(ul.select("li"))

运行结果：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

输出每个ul节点下所有li节点组成的列表

获取属性

既然节点是Tag类型，于是获取属性依然可以用原爱的方法，这里尝试获取每个ul节点的id属性:

from bs4 import BeautifulStoneSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select("ul"):
    print(ul["id"])
    print(ul.attrs['id'])

运行结果：

list-1
list-1
list-2
list-2

可以看到直接将属性名传入中括号和通过attrs属性获得属性值，都能够成功获得属性的。

获取文本

要获取文本，可以用前面所用到的string属性，这里还有一个办法，就是用get_text，二者实现效果完全一致，都可以获取节点的文本值，实例如下：

from bs4 import BeautifulStoneSoup
soup = BeautifulSoup(html,'lxml')
for li in soup.select("li"):
    print("get_text:",li.get_text())
    print("string:",li.string)

运行结果：

get_text: Foo
string: Foo
get_text: Bar
string: Bar
get_text: Jay
string: Jay
get_text: Foo
string: Foo
get_text: Bar
string: Bar