python爬虫之下载盗墓笔记（bs4解析HTML）

前言：

最近一个作业用到爬虫，我爬取的网站是拉勾网，返回的是json格式，我就用字典的形式获取数据了

这次顺便把bs4解析返回的HTML格式也熟悉一下

爬了一个简单的网站：http://www.seputu.com

学习了下https://www.cnblogs.com/insane-Mr-Li/p/9117005.html的内容，自己动手开始搞了，基本原理差不多

记下主要用法：

通过检查元素可以看到每一节的链接和名字都在<li></li>里存着了

所以第一步通过bs4找到这些<li></li>

import requests
from bs4 import BeautifulSoup
url='http://www.seputu.com'
response = requests.get(url)
req_parser = BeautifulSoup(response.text,features="html.parser")#<class 'bs4.BeautifulSoup'>
li = req_parser.find_all('li')#<class 'bs4.element.ResultSet'>
#li = req_parser.findAll('li')#等价上一句

接下来获取链接和名字，获取有两种方法，大同小异：

扫描二维码关注公众号，回复： 9165430 查看本文章

1.用find方法，li的类型是<class 'bs4.element.ResultSet'>，i的类型是<class 'bs4.element.Tag'>，没有find_all方法

name_list=[]
href_list=[]
for i in li:
    try:
        href=i.find('a')['href']
        name=i.find('a').text
        name_list.append(name)
        href_list.append(href)
    except:
        pass

2.转化 li类型为<class 'bs4.BeautifulSoup'>，继续使用find_all方法在li结果里搜索

temp = BeautifulSoup(str(li),features="html.parser")#进行进一步的字符解析因为获取要素类型的值时必须进行这一步
a = temp.find_all('a')
name_list=[]
href_list=[]
for i in a:
    name=i.string
    href=i['href']
    name_list.append(name)
    href_list.append(href)

此处获取<a></a>之间的内容是通过属性text或者string获取

还可以通过findChildren方法获取

i.find('a').findChildren(text=True)[0]

有了名字和链接，接下来就是从链接里找文字了：

同样通过检查文字元素所在位置发现小说文字都是在<div class="content-body">的<p></p>中

response=requests.get(href_list[page])
req_parser= BeautifulSoup(response.content.decode('utf-8'),features="html.parser")
div= req_parser.find_all('div',class_="content-body")
#div= req_parser.find_all('div',{"class":"content-body")#等价上一句

后面再从div里找p，跟前面的道理是一样的，就不赘述了。

完整代码：

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
url='http://www.seputu.com'
response = requests.get(url)
req_parser = BeautifulSoup(response.content.decode('utf-8'),features="html.parser")
li = req_parser.find_all('li')
temp = BeautifulSoup(str(li),features="html.parser")#进行进一步的字符解析因为获取要素类型的值时必须进行这一步
a = temp.find_all('a')
name_list=[]
href_list=[]
for i in a:
    name=i.string
    href=i['href']
    name_list.append(name)
    href_list.append(href)
def download(page):
    response=requests.get(href_list[page])
    req_parser= BeautifulSoup(response.content.decode('utf-8'),features="html.parser")
    div= req_parser.find_all('div',class_="content-body")
    temp = BeautifulSoup(str(div),features="html.parser")
    temp=temp.find_all('p')
    text = []
    for i in temp:
        temp=i.string
        if temp!=None:
            print(temp.encode('gbk','ignore').decode('gbk','ignore'))
            text.append(temp)
    with open('novel.txt','a+',encoding='utf-8') as f:
        f.write(name_list[page])
        f.write('\n')
        for i in text:
            f.write(i)
            f.write('\n')

for i in range(len(href_list)):
    try:
        download(i)
    except:
        pass
    print('%d is over'%i)

最后爬下来的txt文件有9000多行

fff_zrx

发布了62 篇原创文章 · 获赞 118 · 访问量 22万+

私信关注

python爬虫之下载盗墓笔记（bs4解析HTML）

猜你喜欢