学习爬虫,离不开数据解析和分析。python中的BeautifulSoup模块就是绝佳的html解析器,这里记录一下bs4的主要函数。
Install
install bs4
pip3 install beautifulsoup4
install lxml parser
pip3 install lxml
安装lxml解析器可能会出xmlCheckVersion报错,这时候可以到网上下载对应的lxml.whl,用whl来安装即可。
get html
首先从request库获得一个html页面,或者是本地的静态Html页面,用bs4去解析
soup = BeautifulSoup(html_doc, lxml)
//or
url="www.xxx.com"
r=requests.get(url)
soup = BeautifulSoup(r.text, lxml)
parsing function
快速入门,自然是看看bs4有什么好用的解析函数,这里列了最常用的一些方法,对于这样一段html
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
调用bs4去parse
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
Four type of Objects
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.
tag
tag <b class="boldest">Extremely bold</b>
tag type <class 'bs4.element.Tag'>
tag.name 'b'
tag.name = “blockquote” <blockquote class="boldest">Extremely bold</blockquote>
tag <b id="boldest">
tag[‘id’] 'boldest'
tag[‘attribute’] = 1 <b attribute="1" id="verybold"></b>
del tag[‘id’]
NavigableString
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:
不好意思不太统一,还是用大代码块来写比较直观
tag
# <blockquote>Extremely bold</blockquote>
tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>
tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>
NavigableString is nasty, which dosen’t have most of the attributes and functions that tag has.And sometimes it will appear in your search tree randomly(of course not randomly),so i provide a simple way to ignore them later in this article.
BeautifulSoup
The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree
当然,bs4 Object是没有name和attribute的
navigate and iterate
using tag name
only get the first tag by that name
soup.head
# <head><title>The Dormouse's story</title></head>
soup.title = soup.head.title
# <title>The Dormouse's story</title>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Special attribute
A tag’s direct children are available in a list called .contents:
head_tag
# <head><title>The Dormouse's story</title></head>
head_tag.contents
# [<title>The Dormouse's story</title>]
title_tag.contents
# [u'The Dormouse's story']
.descendent considers grandson and so on
for child in head_tag.descendants:
print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story
If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
title_tag.string = title_tag.content
# u'The Dormouse's story'
going up
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>
going sideways
Searching the tree
Signature: find_all(name, attrs, recursive, string, limit, **kwargs)
filter
查看是否含有某个属性:has_attr()
if info.has_attr('property') and not info.has_attr('content'):
some caveats
NavigableString is really sucks.We can igonre it by this way:
from bs4 import BeautifulSoup, NavigableString, Tag
for _minisite in minisite_list:
if isinstance(_minisite, NavigableString):
continue
if isinstance(_minisite, Tag):
print(_minisite)