1.安装

pip install beautifulsoup4

或者

python setup.py install（附件）

2.介绍

解析（markup为html文件内容）

BeautifulSoup(markup, “html.parser”)-》Python 2.7.3以上

BeautifulSoup(markup, “lxml”)-》速度快

BeautifulSoup(markup, “html5lib”)-》容错性好，速度慢from bs4

3.举例

import BeautifulSoup

soup = BeautifulSoup(html)

#soup = BeautifulSoup(open('index.html'))

print soup.prettify()#格式化

4.使用

1.TAG

html1='<head><title>The Dormouse's story</title></head>'

soup = BeautifulSoup(html1)

print soup.title

#<title>The Dormouse's story</title>#仅仅取第一条

print soup.head

#<head><title>The Dormouse's story</title></head>

2.attrs

html2='The Dormouse's story'

soup = BeautifulSoup(html2)

print soup.p.attrs

#{'class': ['title'], 'name': 'dromouse'}

print soup.p['class']

#['title']

print soup.p.get('class')

#['title']

3.string

print soup.p.string

#The Dormouse's story

4.Comment

无注释的文本

html3='<a class="sister" href="http://example.com/elsie" id="link1"></a>'

soup = BeautifulSoup(html2)

if type(soup.a.string)==bs4.element.Comment:

print soup.a.string

5.contents

print soup.head.contents #打印子节点，为list集合

#[<title>The Dormouse's story</title>]

6.children

for child in soup.body.children:#list生成器子节点

print child

7.descendants

for child in soup.descendants:#递归打印所有子节点

print child

8.stripped_strings

for string in soup.stripped_strings:#子节点下所有string，且去除换行符等空白

print(repr(string))

# u"The Dormouse's story"

# u'Once upon a time there were three little sisters

9.parent parents

p = soup.p#父节点

print p.parent.name

#body

10.find_all( name , attrs , recursive , text , **kwargs )

1）name-tag内容

A.传字符串

soup.find_all('b')

# [The Dormouse's story]

B.传正则表达式

import re

for tag in soup.find_all(re.compile("^b")):

print(tag.name)

# body

# b

C.传列表

soup.find_all(["a","b"])

# [The Dormouse's story,

#<a href="http://example.com/elsie" id="link1">Elsie</a>,

#<a href="http://example.com/lacie" id="link2">Lacie</a>]

D.传 True

True 可以匹配任何值,但是不会返回字符串节点

For tag in soup.find_all(True):

print(tag.name)

# html

# head

# title

# body

# p

E.传方法

def has_class_but_no_id(tag):

return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)

# [The Dormouse's story,

# Once upon a time there were...,

# ...]

2）keyword

soup.find_all(id='link2')

# [<a href="http://example.com/lacie" id="link2">Lacie</a>]

soup.find_all(href=re.compile("elsie"))

# [<a href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all(href=re.compile("elsie"), id='link1')

# [<a href="http://example.com/elsie" id="link1">three</a>]

Class为关键字，需加下划线

soup.find_all("a", class_="sister")

有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性

data_soup.find_all(attrs={"data-foo": "value"})

# [<div data-foo="value">foo!</div>]

3）text 参数

通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True

soup.find_all(text="Elsie")

# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])

# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))

[u"The Dormouse's story", u"The Dormouse's story"]

4）limit 参数

当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果.

soup.find_all("a", limit=2)

# [<a href="http://example.com/elsie" id="link1">Elsie</a>,

# <a href="http://example.com/lacie" id="link2">Lacie</a>]

5）recursive 参数

调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .

11.find( name , attrs , recursive , text , **kwargs )

它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果

12.find_parents() find_parent()

13.CSS选择器select

soup.select()，返回类型是 list

（1）通过标签名查找

print soup.select('title')

#[<title>The Dormouse's story</title>]

print soup.select('a')

（2）通过类名查找

print soup.select('.sister')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通过 id 名查找

print soup.select('#link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

（4）组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

print soup.select('p #link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

（5）直接子标签查找

print soup.select("head > title")

#[<title>The Dormouse's story</title>]

（6）属性查找

print soup.select('a[class="sister"]')

（7）通配符

print soup.select('a[class*="sister"]')#包含sister

print soup.select('a[class$="sister"]')# sister结尾

print soup.select('a[class^="sister"]') # sister开头

python-bs4->Beautiful Soup的用法

1.安装

2.介绍

3.举例

4.使用

1.TAG

2.attrs

3.string

4.Comment

5.contents

6.children

7.descendants

8.stripped_strings

9.parent parents

10.find_all( name , attrs , recursive , text , **kwargs )

1）name-tag内容

2）keyword

3）text 参数

4）limit 参数

5）recursive 参数

11.find( name , attrs , recursive , text , **kwargs )

12.find_parents() find_parent()

13.CSS选择器select

（1）通过标签名查找

（2）通过类名查找

（3）通过 id 名查找

（4）组合查找

（5）直接子标签查找

（6）属性查找

（7）通配符

猜你喜欢