用requests和BeautifulSoup爬取静态网页 - 代码天地

用requests和BeautifulSoup爬取静态网页

其他 2021-03-19 21:25:21 阅读次数: 0

用requests和BeautifulSoup爬取静态网页

一、案例说明

本案例使用requests和BeautifulSoup爬取湖北经济学院经院要闻的前2页新闻标题、日期、发布者、内容
二、爬虫思路
首先找到网址（http://news.hbue.edu.cn/jyyw/list.htm）的页面，右键“检查”，显示出开发者模式

发现每页的新闻网址都为（http://news.hbue.edu.cn/jyyw/list+数字.htm），所以可以根据这个信息来爬取不同的新闻网页

发现每页新闻的网址都在span class="Article_Title"中，，所以可以根据这个信息来爬取不同的新闻网页信息
三、代码

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import re
def getnews(newurl):
    html = requests.get(newurl)
    bs = BeautifulSoup(html.content,'lxml')
    the_title = bs.find(name='h1',class_="arti_title")
    title = re.sub(' ','',the_title.string)
    #用正则表达式将空格去除
    publisher = bs.find(name='span',attrs={
    
    'class':'arti_publisher'})
    date = bs.find(class_="arti_update")
    print(title)
    print(publisher.string)
    print(date.string)
    #获取过滤出的节点文本内容，用.string
for i in range(1,3):
    url = 'http://news.hbue.edu.cn/jyyw/list' + str(i) + '.htm'
    html = requests.get(url)
    #用requests的get方法
    bs = BeautifulSoup(html.content,'lxml')
    #需使用.content
    newurlset = bs.find_all(name='span',attrs={
    
    'class':'Article_Title'})
    #BeautfiulSoup的find_all返回的是tag对象的集合，故可以用循环语句提取
    for i in newurlset:
    #因为有个新闻的网址链接与其他不同，故加上这个判断语句
        if 'http://news.hbue.edu.cn' in i.a.attrs['href']:
            newurl = i.a.attrs['href']
        else:
            newurl = 'http://news.hbue.edu.cn' + i.a.attrs['href']
        #找到网页前缀，再提取出下一步。因为a标签为span节点的子节点，故可直接选。但只可选择至子节点。
        getnews(newurl)

猜你喜欢

转载自blog.csdn.net/sgsdsdd/article/details/109325059

用requests和BeautifulSoup爬取静态网页

python使用requests和BeautifulSoup爬取网页乱码问题

requests与BeautifulSoup爬取网页图片

Python爬虫实战：使用Requests和BeautifulSoup爬取网页内容

requests与BeautifulSoup结合爬取网页数据应用

xpath和beautifulsoup爬取网页的demo

使用Requests和BeautifulSoup爬取妹子图

python获取网页page数，同时按照href批量爬取网页（requests+BeautifulSoup）

requests+beautifulsoup爬取豆瓣图书

python爬虫爬取招聘（ requests，BeautifulSoup）

Python爬虫学习三------requests+BeautifulSoup爬取简单网页

python 爬虫（一） requests+BeautifulSoup 爬取简单网页代码示例

python爬虫——利用requests库BeautifulSoup简单爬取网页上照片

python爬虫——利用requests库BeautifulSoup定向爬取网页内容写入txt文件

python爬虫——利用requests库BeautifulSoup简单爬取网页上照片—代码完善

Python使用urllib,urllib3,requests库+beautifulsoup爬取网页

爬取静态网页

【爬虫】002 python3 +beautifulsoup4 +requests 爬取静态页面

ython 从零开始爬虫(三)：实战：requests+BeautifulSoup实现静态爬取

Python爬虫自学之第（③）篇——实战：requests+BeautifulSoup实现静态爬取

Python爬虫（一）：用 Requests + BeautifulSoup 爬取网站上的信息

Python使用BeautifulSoup爬取网页信息

Python爬虫实践~BeautifulSoup+urllib+Flask实现静态网页的爬取

用requests爬取图片

爬虫入门（一）：用Python爬取静态HTML网页

Python3爬虫--两种方法（requests(urllib)和BeautifulSoup）爬取网站pdf

python使用requests和BeautifulSoup包爬取Pixiv图片--指定tag下的所有作品

利用requests和BeautifulSoup爬取菜鸟教程的代码与图片并保存为markdown格式

利用python的requests和BeautifulSoup库爬取小说网站内容

Python网页解析库：用requests-html爬取网页

今日推荐

周排行

vue + echart +map中国地图，省市地图，区县地图

spring boot2 (31)-cors跨域请求

『学习资料推荐』299元买的微信营销资料打包

个人学习卷积神经网络的疑惑解答

网络工程师-软考

模拟人生4 春夏秋冬、星梦起飞版更新下载方法以及常见问题

python关于对象的字符串显示str和repr以及

奇怪的session混乱问题

【3】分治法（divide-and-conquer）

Java项目开发成绩管理系统（九）各模块实现信息修改

每日归档

更多

2024-08-07(0)

2024-08-06(0)

2024-08-05(0)

2024-08-04(0)

2024-08-03(0)

2024-08-02(0)

2024-08-01(0)

2024-07-31(0)

2024-07-30(0)

2024-07-29(0)