Python 爬虫 ---- Beautiful Soup（一）

初识 Beautiful Soup

假设有下面这样一段 HTML 代码

  html_doc = """
  <html><head><title>The Dormouse's story</title></head>
  <body>
  <p class="title"><b>The Dormouse's story</b></p>
  <b><!--Hey, buddy. Want to buy a used parser?--></b>
  <p class="story">Once upon a time there were three little sisters; and their names were
  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  and they lived at the bottom of a well.</p>

  <p class="story">...</p>
"""

为了解析这段代码，先将其转换为 Beautiful Soup 对象

  from bs4 import BeautifulSoup
  soup = BeautifulSoup(html_doc,'lxml')

一、 Beautiful Soup 对象的组成部分

BS 对象的内容可以分为四部分: Tag 、 NavigableString 、BeautifulSoup 和 Comment 。下面分别介绍这四个部分：

Tag

Tag 包含两个属性：name 和 attributes。
name ： name 通过 ‘.’ 来获取，可以通过赋值修改

  tag = soup.b   
  tag.name  # 取出 name 值

  tag.name = "blockquote" #如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档
  tag   ->   # <blockquote class="boldest">Extremely bold</blockquote>

attributes：一个 Tag 可能有很多个属性。tag <bclass="boldest">有一个“class”的属性,值为“boldest”。Tag 的属性的操作方法与字典相同:

  tag['class']   ->   #u'boldest'  
  tag.get('class')  ->   #u'boldest'  
  tag.attrs  ->  # {u'class': u'boldest'}  #获取一个标签包含的所有属性
  tag['class'] = 'verybold'   # 赋值
  del tag['class']   # 删除
  # HTML中有多值属性（XML中没有）
  css_soup = BeautifulSoup('<p class="body strikeout"></p>')
  css_soup.p['class']  ->  # ["body", "strikeout"]
  rel_soup.a['rel'] = ['index', 'contents']  # 多值属性赋值

NavigableString （可遍历字符串）

一个 NavigableString 字符串与 Python 中的 Unicode 字符串相同,并且还支持包含在遍历文档树和搜索文档树中的一些特性。
通过 unicode() 方法可以直接将 NavigableString 对象转换成 Unicode 字符串:

  tag.string    # 获取 tag 中的字符串
  type(tag.string)  # <class 'bs4.element.NavigableString'>
  unicode_string = unicode(tag.string)
  type(unicode_string)  # <type 'unicode'>  # 类型改变

Tag 中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法:

  tag.string.replace_with("No longer bold")
  tag  # <blockquote>No longer bold</blockquote>

BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法.
因为 BeautifulSoup 对象并不是真正的 HTML 或 XML 的 Tag，所以它没有 name 和 attribute 属性.但有时查看它的 name 属性是很方便的，因为 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊 name 属性
soup.name -> ‘[document]’

Comment （注释及特殊字符串）

comment 对象是一个特殊类型的 NavigableString 对象:

  markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
  soup = BeautifulSoup(markup)
  comment = soup.b.string
  type(comment)  # <class 'bs4.element.Comment'>

二、 BS 对象的解析

以开头给出的 HTML 代码为例，解析方式如下（‘#’后为输出结果）：

  soup = BeautifulSoup(html_doc,'lxml')
  soup.title  # <title>The Dormouse's story</title>
  soup.title.name  # u'title'
  soup.title.string  # u'The Dormouse's story'
  soup.title.parent.name  # u'head'
  soup.p  # <p class="title"><b>The Dormouse's story</b></p>
  soup.p['class']  # u'title'
  soup.a 
  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  soup.find_all('a')  
  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  soup.find(id="link3")
  # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

  # 从文档中找到所有<a>标签的链接:
  for link in soup.find_all('a'):
      print(link.get('href'))
      # http://example.com/elsie
      # http://example.com/lacie
      # http://example.com/tillie
    
  # 从文档中获取所有文字内容:
  print(soup.get_text())

更复杂的解析需要用到遍历和搜索文档树，这些内容后面再讨论。