版权声明:没有注明参考链接,均为原创,如有侵权,请联系博主! https://blog.csdn.net/qq_29647709/article/details/81910328
每天有很多优秀的文章会出现,有些优秀的历史文章尘封在远远的历史轨迹中,不去发觉就会错过。使用爬虫技术,将博客文章爬取,选择自己爱看的文章“大饱眼福”!
0X00 BeautifulSoup基础
1、解析内容:
from bs4 import BeautifulSoup
soup =BeautifulSoup(html_doc,’lxml’)
2、浏览数据:
soup.title
soup.title.string
3、正则使用:
soup.findall_all(name=’x’,attrs={‘xx’:re.compile(‘xxx’)})
eg1(匹配关键属性名称):
匹配如下:
bbs_news = soup.find_all(name=’a’,attrs={‘class’:’ui_colorG’})
for news in bbs_news:
print(news.string)就可以输出内容
eg2(正则):
bbs_news = soup.findall_all(name=’a’,attrs={‘href’:re.compile(‘thread-/d*?-1-1.html’)})
4、select使用:
示范代码:
1 html = """
2 <html><head><title>The Dormouse's story</title></head>
3 <body>
4 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
5 <p class="story">Once upon a time there were three little sisters; and their names were
6 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
7 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
8 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
9 and they lived at the bottom of a well.</p>
10 <p class="story">...</p>
11 """
(1)通过标签名查找
print soup.select('title')
#[<title>The Dormouse's story</title>]
print soup.select('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
print soup.select('b')
#[<b>The Dormouse's story</b>]
(2)通过类名查找
print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
(3)通过 id 名查找
print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
(4)组合查找
组合查找即和写 class 文件时,标签名与类名、id名进行的组合原理是一样的,例如查找 p 标签中,id 等于 link1的内容,二者需要用空格分开
print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
(5)属性查找
查找时还可以加入属性元素,属性需要用中括号括起来,注意属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。
print soup.select("head > title")
#[<title>The Dormouse's story</title>]
print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
0X01 爬虫实例展示
90BLog文章及链接爬取
先知论坛爬取