Python爬虫——爬取古诗文网 - 代码天地

Python爬虫——爬取古诗文网

移动开发 2023-05-05 01:15:01 阅读次数: 0

运用正则表达式极简代码爬取古诗信息（主要是练习正则表达式，很多后续的功并没有实现）

我们先看一下结果：在这里插入图片描述注:只是截取了一部分结果

我们直接上代码：

import  re
import  requests

def parse_url(url):
    headers={
    
    'User-Agent':
	'此处需要替换成你本机的'
    }
    reponse = requests.get(url, headers= headers)
    text = reponse.text
    # print(text)
    titles = re.findall(r'<div\sclass="cont">.*?<b>(.*?)</b>',text,re.DOTALL)
    # print(titles)
    authors  = re.findall(r'<p\sclass="source">.*?<a .*?>(.*?)</a>',text,re.DOTALL)
    # print(authors)
    dynasties = re.findall(r'<p\sclass="source">.*?<a .*?>.*?<a .*?>(.*?)</a>',text,re.DOTALL)
    # print(dynasties)
    contents = re.findall(r'<div\sclass="contson".*?>(.*?)</div>',text,re.DOTALL)
    # print(contents)
    All_contents=[] # 定义一个空列表
    for i in contents:
        content = re.sub(r'<.*?>', " ", i)
        All_contents.append(content.strip())
    # print(All_contents)
    poems=[]  # 定义一个空列表
    for value in zip(titles, authors, dynasties, All_contents):# 将这四个列表中的每一项分别一一对应
        title,author,dynasty,content = value   # 将value里面的每一项赋予名字
        poem={
    
    
            'title':title,
            'author':author,
            'dynasty':dynasty,
            'content':content
        }
        poems.append(poem) # 将poem追加到poems中
    print(poems)


def main():
    for i in range(1,5):
        url = 'https://www.gushiwen.cn/default_{}.aspx'.format(i)
        parse_url(url)


if __name__ == '__main__':
    main()

注：中间代码没有问题哦！

猜你喜欢

转载自blog.csdn.net/m0_48915964/article/details/115575182

Python爬虫——爬取古诗文网

python爬取古诗文网

python小白学习记录结合scrapy编写爬虫爬取古诗文网右侧的标签

爬取古诗文网古诗词

爬取古诗文网的推荐古诗

古诗文网爬虫

【Python3 爬虫】U20_正则表达式爬取古诗文网

爬虫实战——利用正则表达式爬取古诗文网

Python网络爬虫：爬取古诗文中的某个制定诗句来实现搜索

python 爬取古诗文存入mysql

python爬虫---代理、Cookie、模拟登录古诗文网

「python爬虫之路day9」:实战之使用正则表达式爬取抓狂网，古诗文网信息

python爬取古诗文网站诗文一栏的所有诗词

初识python 之爬虫：使用正则表达式爬取”古诗文“网页数据

爬虫古诗文网站

Python 正则表达式之爬取古诗文名句

爬虫_古诗文网（正则表达式）

爬虫之验证码识别--古诗文网

爬虫-requests-cookie登录古诗文网

python爬虫学习（十六）古诗文网验证码识别

古诗文网站之网络爬虫

正则提取案例(古诗文网)

云打码古诗文网

中国古诗文网

初识python 之爬虫：使用正则表达式爬取“糗事百科 - 文字版”网页数据初识python 之爬虫：使用正则表达式爬取”古诗文“网页数据

爬虫_古诗文网(队列，多线程，锁，正则，xpath)

爬虫day01(上午) 模拟登录古诗文网

python3爬虫验证码识别——超级鹰打码平台的使用&实战：识别古诗文网登录页面中的验证码

古诗文网验证码识别

21天打造分布式爬虫-中国天气网和古诗文网实战（四）

今日推荐

周排行

LRU cache算法

windows10, 自带的OpenSSH, key权限问题, 文件权限问题

测试用例书写方法

HIVE-默认分隔符的（linux系统的特殊字符）查看，输入和修改

最贵的AMD 7nm显卡来了！这设计够狂野

java多线程简单demo

[ 转载 ]在Android系统上使用busybox——最简单的方法

QT connect学习

BFSIFT算法分析

Xcode10：library not found for -lstdc++.6.0.9 临时解决

每日归档

更多

2024-08-06(0)

2024-08-05(0)

2024-08-04(0)

2024-08-03(0)

2024-08-02(0)

2024-08-01(0)

2024-07-31(0)

2024-07-30(0)

2024-07-29(0)

2024-07-28(0)