urllib实战4--新闻爬虫（020）

其他 2018-05-07 10:18:17 阅读次数: 4

一：需求与思路

需求：将新浪新闻的首页的所有新闻爬取到本地 http://news.sina.com.cn/

思路：首先爬首页，通过正则表达式获取所有的新闻链接，然后依次爬取新闻，并存储到本地。

二：实战

运行程序

查看爬取结果：

三：综上代码：

from urllib import request
import re
data=request.urlopen("http://news.sina.com.cn/").read()
data2=data.decode("utf-8","ignore")
pat='href="(http://news.sina.com.cn/.*?)"'
allurl=re.compile(pat).findall(data2)
for i in range(0,len(allurl)):

try:

print("这是第"+str(i)+"次爬取")

thisurl=allurl[i]
file="G:/BaiduDownload/python网络爬虫/sinanews/"+str(i)+".html"

request.urlretrieve(thisurl,file)

print("成功")

except urllib.error.URLError as e:
if hasattr(e,"code"):
print(e.code)
if hasattr(e,"reason"):
print(e.reason)

猜你喜欢

转载自blog.csdn.net/weixin_41167340/article/details/79779154

urllib实战4--新闻爬虫（020）

urllib基础、超时设置、Get与Post请求、异常处理、浏览器伪装、Python新闻爬虫实战

python爬虫urllib使用和进阶 | Python爬虫实战二（4）

python3+urllib撸新浪滚动新闻爬虫

4--

Python3爬虫实战（urllib模块）

Python——关于爬虫基础Urllib实战

TensorFlow实战系列4-- 解决MNIST 问题

python爬虫实战——爬取腾讯新闻！

Python爬虫 | 爬取环境新闻实战

python爬虫urllib使用和进阶 | Python爬虫实战二

爬虫urllib

爬虫_urllib

Python爬虫实战项目：简单的百度新闻爬虫

Python爬虫实战(5)_面向新闻网站的爬虫

Python爬虫实战——爬取新闻数据（简单的深度爬虫）

爬虫实战之四--urllib库的使用详解

分治4--逆序对

python爬虫4--urllib请求库之robotparser模块

大规模异步新闻爬虫【4】：实现一个同步定向新闻爬虫

Python 爬虫之urllib库，及urllib库的4个模块基本使用和了解

python爬虫实战（2）——爬取腾讯新闻

python实战之网络爬虫（爬取新闻内文信息）

网络爬虫---用scrapy框架爬取腾讯新闻实战

Python爬虫实战教程：爬取网易新闻

自己做量化交易软件(30)小白量化实战4--动于阴末止于阳极

GNE: 4行代码实现新闻类网站通用爬虫

python爬虫12：实战4

python爬虫urllib使用和进阶 | Python爬虫实战二（1）

redis学习4--事物

今日推荐

周排行

成为C++高手之宏与枚举

在CAD二次开发中使用进度条

Js插件ECharts，HighCharts学习网址整理

Celery提交任务出错(on windows.)

cephfs内核客户端性能追踪

thinkphp中PHPExcel用法

EntityFramework动态组合多排序字段

汇编语言（八）实验9 根据材料编程

安装ubuntu后必须做的事情（对我而言）

JS函数式编程

每日归档

2024-10-22(0)

2024-10-21(0)

2024-10-20(0)

2024-10-19(0)

2024-10-18(0)

2024-10-17(0)

2024-10-16(0)

2024-10-15(0)

2024-10-14(0)

2024-10-13(0)