在懒人盘分享的高分电子书是利用python批量查询得到的
发现常规的豆瓣api不允许调用了,几番搜索发现一个接口
https://book.douban.com/j/subject_suggest?q=书名
利用这个接口可以拿到书籍在豆瓣上的的url
获取单本书籍url的函数
def get_book(title):
url = "https://book.douban.com/j/subject_suggest?q=%s"%title
rsp = requests.get(url,headers=get_headers())
rs_dict = json.loads(rsp.text)
url_ = rs_dict[0]['url']
print(url_)
return get_detail(url_)
get_book("红楼梦")
运行该函数后可以得到url
https://book.douban.com/subject/1007305/
可以看到这就是豆瓣详情页了
接下来对详情页里的分数进行爬取
def get_detail(url):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
# url = "https://book.douban.com/subject/34869428/"
web_data = requests.get(url,headers=headers)
soup = BeautifulSoup(web_data.text,'lxml')
rank = soup.select('#interest_sectl > div > div.rating_self.clearfix > strong')[0].get_text().strip()
print(rank)
return rank
get_detail("https://book.douban.com/subject/34869428/")
运行上面函数可以输出评分:9.6
完整代码
#-*- coding:utf-8 -*-
from bs4 import BeautifulSoup
import requests,time,json
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
def get_detail(url):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
# url = "https://book.douban.com/subject/34869428/"
web_data = requests.get(url,headers=headers)
soup = BeautifulSoup(web_data.text,'lxml')
rank = soup.select('#interest_sectl > div > div.rating_self.clearfix > strong')[0].get_text().strip()
# print(rank)
return rank
def get_book(title):
url = "https://book.douban.com/j/subject_suggest?q=%s"%title
rsp = requests.get(url,headers=headers)
rs_dict = json.loads(rsp.text)
# print(rs_dict)
url_ = rs_dict[0]['url']
# print(url_)
return url_,get_detail(url_)
if __name__ == '__main__':
book_list=["红楼梦","三国演义","水浒传","西游记"]
for i in book_list:
url,rank = get_book(i)
print(i,rank,url)
直接运行上面代码可以得到:
红楼梦 9.6 https://book.douban.com/subject/1007305/
三国演义 9.2 https://book.douban.com/subject/1019568/
水浒传 8.6 https://book.douban.com/subject/1008357/
西游记 8.9 https://book.douban.com/subject/1029553/
为方便大家理解,以列表形式拿四大名著举例,大规模爬取建议采用数据库形式,加上延时防封,还可以加上代理,cookie,多线程爬取等,这里不展开。
利用https://book.douban.com/j/subject_suggest?q=书名这个接口有个缺陷是部分数据找不到数据,解决方法是利用selenium模拟人工查询爬取(selenium是个万能神器,缺点是速度比较慢),不过接口已经可以满足绝大部分书籍查询,再利用评分做个筛选就有了上面的文章为大家挑出评分高的书籍了。由于已经给大家选出20本了,就不用selenium再爬了。
附上公众号“懒人找资源”二维码