Beautiful Soup select语法记录

假设有一个这样的html文件

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list
（1）通过标签名查找

print soup.select('title') 
#[<title>The Dormouse's story</title>]
 
print soup.select('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
print soup.select('b')
#[<b>The Dormouse's story</b>]

（2）通过类名查找

print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通过 id 名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（4）子标签查找

1.直接子元素

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

2.间接子元素

不需要 > 符号

例如

<div class="right-content">
    <ul class="news-1" data-sudaclick="news_p">
        <li><a href="http://news.sina.com.cn/c/2018-09-22/doc-ihkhfqnt7246449.shtml"                target="_blank">瑞典电视台播辱华节目 我大使馆：挑战人性良知</a></li>		
        <li><a href="http://news.sina.com.cn/c/2018-09-22/doc-ihkmwytn5895667.shtml" target="_blank">日本侦察机四天五次绕飞雪龙号 所为何事？</a></li>		
        <li><a href="http://news.sina.com.cn/c/2018-09-22/doc-ifxeuwwr7253238.shtml" target="_blank">山西“吕梁头号官霸”敛财10亿 离任时有人送花圈</a></li>		
        <li><a href="http://news.sina.com.cn/c/2018-09-22/doc-ihkmwytn5836797.shtml" target="_blank">“中国人素质全球倒数”谣言又来 联合国亲自打脸</a></li>
    </ul>

使用代码

import requests
from bs4 import  BeautifulSoup
res = requests.get("https://news.sina.com.cn/china/")
res.encoding = "utf-8"
soup = BeautifulSoup(res.text,"html.parser")
for newslink in soup.select(".right-content>li a"):
    print("新闻链接",newslink["href"])
    print("新闻标题",newslink.text)

没有输出

而使用代码

import requests
from bs4 import  BeautifulSoup
res = requests.get("https://news.sina.com.cn/china/")
res.encoding = "utf-8"
soup = BeautifulSoup(res.text,"html.parser")
for newslink in soup.select(".right-content li a"):
    print("新闻链接",newslink["href"])
    print("新闻标题",newslink.text)

输出

新闻链接 http://news.sina.com.cn/c/2018-09-22/doc-ihkhfqnt7246449.shtml
新闻标题 瑞典电视台播辱华节目 我大使馆：挑战人性良知
新闻链接 http://news.sina.com.cn/c/2018-09-22/doc-ihkmwytn5895667.shtml
新闻标题 日本侦察机四天五次绕飞雪龙号 所为何事？
新闻链接 http://news.sina.com.cn/c/2018-09-22/doc-ifxeuwwr7253238.shtml
新闻标题 山西“吕梁头号官霸”敛财10亿 离任时有人送花圈
新闻链接 http://news.sina.com.cn/c/2018-09-22/doc-ihkmwytn5836797.shtml
新闻标题 “中国人素质全球倒数”谣言又来 联合国亲自打脸
新闻链接 http://news.sina.com.cn/c/2018-09-24/doc-ihkmwytn7611790.shtml
新闻标题 美媒：用AI和无人机修复 中国长城还能存在几百年
新闻链接 http://news.sina.com.cn/c/2018-09-24/doc-ihkmwytn7608791.shtml
新闻标题 中国民用航空飞行学院:没有学姐为学妹立规矩之说
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ihkmwytn7553959.shtml
新闻标题 人民日报评瑞典播放辱华节目：辱华者须付出代价
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ifxeuwwr7543016.shtml
新闻标题 香港各界对高锟逝世深表哀悼：为人为学皆为楷模
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ihkmwytn7518406.shtml
新闻标题 瑞典电视台就辱华节目不道歉 我们该置多少气？
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ihkmwytn7508396.shtml
新闻标题 这个省会市长空缺近8个月后 迎来新任市委副书记
新闻链接 http://news.sina.com.cn/s/2018-09-23/doc-ihkmwytn7488367.shtml
新闻标题 家长晒官职全网疯传背后 是一件大事已发生质变
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ifxeuwwr7531642.shtml
新闻标题 民调称柯文哲满意度6成 竞选对手：政治骗子
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ihkmwytn7446955.shtml
新闻标题 辽宁铁岭调兵山市发生2.9级地震 震源深度0千米
新闻链接 http://news.sina.com.cn/c/2018-09-23/doc-ihkmwytn7436720.shtml
新闻标题 环球谈瑞典辱华节目:认为"洋大人"没错
新闻链接 http://news.sina.com.cn/o/2018-09-23/doc-ihkmwytn7417416.shtml
新闻标题 越南将为已故国家主席陈大光举行国葬 降半旗致哀

一字之别

（5）组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（6）属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print soup.select("head > title")
#[<title>The Dormouse's story</title>]
 
print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

Beautiful Soup select语法记录

猜你喜欢