Python网络爬虫与信息提取（2）—— 爬虫协议

其他 2020-04-22 10:39:28 阅读次数: 0

前言

上一节学习了requests库，这一节学习robots协议

限制爬虫的方法

审查爬虫来源，需要网站作者有一定的编程基础
声明robots协议，一般放在网站的根目录下，robots.txt文件

京东robots协议

京东robots链接

User-agent: *
Disallow: /?*
Disallow: /pop/.html
Disallow: /pinpai/.html?*
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider
Disallow: /

其他robots协议

百度robots协议
 新浪robots协议
 qq的robots协议
 qq新闻robots协议
 国家教育部无robots协议
网站不提供robots协议则全网可以随意爬取

robots协议的使用

任意爬虫文件应该可以自动识别robots.txt文件
不遵守爬虫协议则可能面临法律风险
低频率的爬虫访问网站是允许的，但不可商用

总结

爬虫协议说明：
User-angent: *表示所有用户
Disallow：/表示所有目录不可爬取

只会git clone的程序员

发布了81 篇原创文章 · 获赞 21 · 访问量 8794

私信关注

猜你喜欢

转载自blog.csdn.net/qq_37668436/article/details/105557588

Python网络爬虫与信息提取（2）—— 爬虫协议

Python网络爬虫与信息提取Day2

2018.5.4(python网络爬虫与信息提取入门)Robots协议

Python网络爬虫与信息提取（一）

Python网络爬虫与信息提取

网络爬虫与信息提取

Python 信息提取-爬虫

Python 爬虫基础学习--网络爬虫与信息提取

Python网络爬虫与信息提取笔记08-实例2：淘宝商品比价定向爬虫

Python网络爬虫与信息提取（五）信息标记与信息提取的一般方法

Python网络爬虫与信息提取（二）**kwargs参数详解

Python网络爬虫和信息提取（一）

Python网络爬虫与信息提取_Requests库

Python网络爬虫与信息提取Day1

python网络爬虫和信息提取(mooc)

Python网络爬虫信息提取mooc代码实例

Python网络爬虫与信息提取(实例讲解)

Python网络爬虫与信息提取（第四周）

Python 网络爬虫与信息提取（第三周）

Python 网络爬虫与信息提取（第二周）

python之网络爬虫与信息提取(上篇)

Python网络爬虫与信息提取学习记录（2）——关于BeautifulSoup库的用法

Python网络爬虫与信息提取（二）——HTTP协议及Requests库的方法

Python网络爬虫与信息提取(二)：网络爬虫之提取

Python爬虫中的信息提取

python爬虫笔记（五）网络爬虫之提取—信息组织与提取方法（2）信息提取的一般方法

Python网络爬虫与信息提取(第7期) 测验1: Python网络爬虫之规则

Python网络爬虫与信息提取笔记05-信息组织与提取方法

Python网络爬虫与信息提取(三)：网络爬虫之实战

Python网络爬虫与信息提取（二）：网络爬虫之规则

今日推荐

周排行

LRU cache算法

windows10, 自带的OpenSSH, key权限问题, 文件权限问题

测试用例书写方法

HIVE-默认分隔符的（linux系统的特殊字符）查看，输入和修改

最贵的AMD 7nm显卡来了！这设计够狂野

java多线程简单demo

[ 转载 ]在Android系统上使用busybox——最简单的方法

QT connect学习

BFSIFT算法分析

Xcode10：library not found for -lstdc++.6.0.9 临时解决

每日归档

更多

2024-08-06(0)

2024-08-05(0)

2024-08-04(0)

2024-08-03(0)

2024-08-02(0)

2024-08-01(0)

2024-07-31(0)

2024-07-30(0)

2024-07-29(0)

2024-07-28(0)