使用Beautiful Soup

1.简介

　　简单来说Beautiful Soup是Python的一个HTML或XML解析库，可以用来方便的从网页中提取数据。Beautiful Soup提供了一些简单的Python式的函数来打处理导航，搜索，修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据。

　　Beautiful Soup自动将文本文档转换为Unicode编码，输出文档转换为UTF-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时你仅仅需要说明一下原始编码方式就可以了。

2.准备工作

安装Beautiful Soup

a.相关链接

　　官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

　　中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh

　　PyPi　　: https://pypi.python.org/pypi/beautifulsoup4

b.pip3安装

　　pip3 install beautifulsoup4

c.whell安装

　　从PiPy下载whell文件

　　然后使用pip安装whell文件

3.使用Beautiful Soup

1.基本用法

from bs4 import BeautifulSoup

html = """
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>The Beautiful Suop</title>
</head>
<body>
<p class="title" name="dromouse"><b>The story</b></p>
<p class="story" >once upon a time there were three title sisters;and their name were
<a href="http://example.com/elsie" class="sister" id="link1">Elise</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

运行结果如下：

<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   The Beautiful Suop
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The story
   </b>
  </p>
  <p class="story">
   once upon a time there were three title sisters;and their name were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elise
   </a>
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
    and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Beautiful Suop

　　这里首先声明一个变量html，它是一个HTML字符串。但是需要注意，它并不是一个完成的HTML字符串，body和html节点没有闭合。接着我们将它作为第一个参数传递给Beautiful Soup对象，第二个参数为解析器的类型（这里使用的是lxml），此时就完成了Beautiful Soup对象的初始化。然后将这个对象复制给soup变量。接下来就可以调用soup的各个方法和属性来解析这串HTML代码了。

　　首先，调用prettify()方法。这个方法可以把要解析的字符串以标准的缩进格式输出。这里需要注意的是，输出结果包含了body和html节点，也就是说对于不标准的HTML代码Beautiful Soup可以自动更正格式。这一步并不是prettify()做的，而是在初始化时就已经完成了。

　　然后调用soup.title.string。这实际上是输出HTML中title节点的文本内容。So，soup.title可以选出HTML中的节点，再调用string属性就可以得到里面的文本了。

Beautiful Soup的使用（一）

使用Beautiful Soup

1.简介

2.准备工作

3.使用Beautiful Soup

猜你喜欢