学习爬虫，离不开数据解析和分析。python中的BeautifulSoup模块就是绝佳的html解析器，这里记录一下bs4的主要函数。

Install

install bs4

pip3 install beautifulsoup4

install lxml parser

pip3 install lxml

安装lxml解析器可能会出xmlCheckVersion报错，这时候可以到网上下载对应的lxml.whl，用whl来安装即可。

get html

首先从request库获得一个html页面，或者是本地的静态Html页面，用bs4去解析

soup = BeautifulSoup(html_doc, lxml)
//or
url="www.xxx.com"
r=requests.get(url)
soup = BeautifulSoup(r.text, lxml)

parsing function

快速入门，自然是看看bs4有什么好用的解析函数，这里列了最常用的一些方法，对于这样一段html

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link3">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

调用bs4去parse

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

Four type of Objects

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.

tag

tag Extremely bold
tag type <class 'bs4.element.Tag'>
tag.name 'b'
tag.name = “blockquote” <blockquote class="boldest">Extremely bold</blockquote>

tag 
tag[‘id’] 'boldest'
tag[‘attribute’] = 1 
del tag[‘id’]

NavigableString

A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

不好意思不太统一，还是用大代码块来写比较直观

tag 
# <blockquote>Extremely bold</blockquote>
tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>
tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>

NavigableString is nasty, which dosen’t have most of the attributes and functions that tag has.And sometimes it will appear in your search tree randomly(of course not randomly),so i provide a simple way to ignore them later in this article.

BeautifulSoup

The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree

当然，bs4 Object是没有name和attribute的

navigate and iterate

using tag name

only get the first tag by that name

soup.head
# <head><title>The Dormouse's story</title></head>

soup.title = soup.head.title
# <title>The Dormouse's story</title>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Special attribute

A tag’s direct children are available in a list called .contents:

head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
# [<title>The Dormouse's story</title>]

title_tag.contents
# [u'The Dormouse's story']

.descendent considers grandson and so on

for child in head_tag.descendants:
    print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

title_tag.string = title_tag.content
# u'The Dormouse's story'

going up

title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>

going sideways

Searching the tree

Signature: find_all(name, attrs, recursive, string, limit, **kwargs)

filter

查看是否含有某个属性：has_attr()

if info.has_attr('property') and not info.has_attr('content'):

some caveats

NavigableString is really sucks.We can igonre it by this way:

from bs4 import BeautifulSoup, NavigableString, Tag
for _minisite in minisite_list:
    if isinstance(_minisite, NavigableString):
        continue
    if isinstance(_minisite, Tag):
        print(_minisite)

python bs4模块快速入门