上一章节我们介绍了bs4的基本语法,今天我们就来用他实战来解析网页。
获取博客简介
在xpath中我们介绍了如何获取我博客的内容简介,那么今天我们就用Beautiful Soup来获取同样的内容,我们看一下两者差别,xpath的解析,我们可以看https://blog.csdn.net/lovemenghaibin/article/details/82898280
那么同样的解析,我们看一下:
from bs4 import BeautifulSoup
import requests
from lxml import etree
url = "https://blog.csdn.net/lovemenghaibin"
header = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
}
response = requests.get(url, headers=header)
soup = BeautifulSoup(response.text, "lxml")
articles = soup.select(".article-item-box")
infos = []
for article in articles:
title = article.select_one("a").get_text().replace("原", "").strip()
href_url = article.select_one("a")["href"]
content = article.select_one("p[class=content] > a").get_text().strip()
article_date = article.select_one(".info-box > p").get_text().strip()
article_read = article.select(".info-box > p")[1].get_text().replace("阅读数:","").strip()
article_comment = article.select(".info-box > p")[2].get_text().replace("评论数:","").strip()
info = {
"title": title,
"href": href_url,
"content": content,
"article_date": article_date,
"article_read": article_read,
"article_comment": article_comment
}
infos.append(info)
print(infos)
获取基本信息
from bs4 import BeautifulSoup
import requests
url = "https://blog.csdn.net/lovemenghaibin"
header = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
}
response = requests.get(url, headers=header)
soup = BeautifulSoup(response.text, "lxml")
home_url = soup.select_one("#uid")["href"]
name = soup.select_one("#uid").get_text()
profile = soup.select_one("#asideProfile")
article_count = soup.select("div .data-info dl")[0].select_one(".count").get_text()
fans = profile.select_one("#fanBox dd").get_text()
attentions = soup.select("div .data-info dl")[2].select_one("dd").get_text()
comment_count = soup.select("div .data-info dl")[3].select_one("dd").get_text()
blog_level = soup.select("div .grade-box dl")[0].select_one("a")["title"].split(",")[0]
read_total_count = soup.select("div .grade-box dl")[1].select_one("dd").get_text().strip()
point_count = soup.select("div .grade-box dl")[2].select_one("dd").get_text().strip()
rank = soup.select("div .grade-box dl")[3]["title"]
info = {
"name": name,
"home_url": home_url,
"article_count": article_count,
"fans": fans,
"attentions": attentions,
"comment_count": comment_count,
"blog_level":blog_level,
"read_total_count":read_total_count,
"point_count":point_count,
"rank":rank
}
小结
其实通过这一节和上一届对比,我们可以发现,在beautiful soup中,其实用的基本也就是css选择器,就可以完成我们这些基本的操作,非常的简单,而不是从内容中去找规律,他对于文档的结构依赖更加的少,我们尽量按照属性来查找,而不是顺序。但是如果没有一个可标识的内容,那就只能用css取出列表,并取出第n个的标签的内容。