版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/csdnlinyongsheng/article/details/85045133
准备
豆瓣读书网址是:https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=880&type=T
红色箭头标记的就是我们要获取的信息,我们有了目标信息,就能找到页面的源码,我们通过解析源码来获取信息数据,我们如何获取源码呢?这时可引入request来解决,实现代码如下:
import requests
resp = requests.get('https://book.douban.com/top250?start=0')
print(resp.text)
运行程序我们能就能得到HTML信息,问题来了,获取了HTML信息,怎样获取我们想要的目标信息呢?
打开浏览器,按键盘F12,从页面源码找到我们想要的目标信息,如图所示:
可以看到书名信息包含在class='info'
h2标签里的a标签。发现目标位置后,我们可以利用BeautifulSoup
来获得一个对象,按找标准的缩进显示的html
代码:
#python环境中如果没有ba4和lxml,要先安装 pip install bs4 and pip install lxml
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.text, 'lxml')
推荐大家使用lxml解析器,因为他快。当然,如果大家怕麻烦,也完全可以使用Python的内置标准库html.parser
.对我们获得结果并没有影响。
爬虫——豆瓣读书代码如下:
import requests;
from bs4 import BeautifulSoup;
def get_html(url):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
resp = requests.get(url, headers = headers).text;
return resp;
def html_parser():
for url in all_page():
soup = BeautifulSoup(get_html(url), 'lxml');
# 书名
allDiv = soup.find_all('div', class_= 'info')
names = [a.find("a")["title"] for a in allDiv];
#作者
versions = [];
pubs = soup.find_all(class_="pub");
versions = [i.get_text().strip() for i in pubs];
#评分
ratingNums = soup.find_all(class_="rating_nums");
ratings = [i.get_text().strip() for i in ratingNums];
#简介
allDiv2 = soup.select('.info p');
jianjie = [i.get_text().strip() for i in allDiv2];
# jianjie = [a.find("p").get_text().strip() for a in allDiv2];
for name, version, rating, p in zip(names, versions, ratings, jianjie):
name = "书名:" + str(name) + "\n";
version = "作者:" + str(version) + "\n";
rating = "评分:" + str(rating) + "\n";
p = "简介:" + str(p) + "\n";
data = name + version + rating + p;
f.writelines(data + "==================" + "\n");
def all_page():
base_url = "https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=";
urlList = [];
for page in range(0, 900, 20):
allurl = base_url+ str(page);
urlList.append(allurl);
return urlList;
filename = "豆瓣读书.txt";
f = open(filename, 'w', encoding = "utf-8");
html_parser();
f.close();
print("保存成功 ");