Python 网络爬虫笔记2 – Requests库实战

Python 网络爬虫系列笔记是笔者在学习嵩天老师的《Python网络爬虫与信息提取》课程及笔者实践网络爬虫的笔记。

课程链接：Python网络爬虫与信息提取
参考文档：
Requests 官方文档（英文）
Requests 官方文档（中文）
Beautiful Soup 官方文档
 re 官方文档
 Scrapy 官方文档（英文）
Scrapy 官方文档（中文）

1、Robots 协议

作用： 网站告知网络爬虫哪些页面可以抓取，哪些不行

形式： 在网站根目录下的robots.txt文件

Robots协议基本语法：

User-agent：访问对象，*代表所有
Disallow：不予许爬取的目录，/代表根目录

# 京东Robots 协议：https://www.jd.com/robots.txt

User-agent: * 
Disallow: /?* 
Disallow: /pop/*.html 
Disallow: /pinpai/*.html?* 
User-agent: EtaoSpider 
Disallow: / 
User-agent: HuihuiSpider 
Disallow: / 
User-agent: GwdangSpider 
Disallow: / 
User-agent: WochachaSpider 
Disallow: /

2、京东商品页面的爬取

访问京东网站，获取所要爬取商品的url链接
使用爬取网页的通用代码框架

import requests

def get_html_text(url):
    """
    爬取网页的通用代码框架
    """
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return '产生异常'


def jd_goods():
    """
    爬取京东上的某个商品，以华为 mate20 为例
    """
    url = 'https://item.jd.com/100000822981.html'
    print(get_html_text(url))

3、亚马逊商品页面的爬取

访问亚马逊网站，获取所要爬取商品的url链接
亚马逊会拒绝非浏览器的请求，需修改url头部，伪装成浏览器发送请求
修改爬取网页的通用代码框架

import requests

def amazon_goods():
    """
    爬取京东上的某个商品，以Kindle为例
    """
    url = 'https://www.amazon.cn/gp/product/B07746N2J9'
    try:
        hd = {'user-agent': 'Mozilla/5.0'}
        r = requests.get(url, headers=hd)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        print(r.text[0:1000])
    except:
        print('爬取失败')


if __name__ == '__main__':
    amazon_goods()

4、百度/360搜索关键字提交

百度的关键词接口：http://www.baidu.com/s?wd=keyword
360的关键词接口：http://www.so.com/s?q=keyword

import requests

def baidu_search():
    """
    使用百度搜索引擎，提交关键词查询
    """
    url = 'http://www.baidu.com/s'
    try:
        kv = {'wd': 'python'}
        r = requests.get(url, params=kv)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        print(r.text)
    except:
        print('爬取失败')

if __name__ == '__main__':
    baidu_search()

5、网络图片的爬取和存储

获取图片的url链接
设置图片保存路径
下载图片并保存

import requests
import os

def download_image():
    """
    爬取图片，以百度的logo为例
    """
    url = 'https://www.baidu.com/img/bd_logo1.png'
    root = 'E:/pics/'
    path = root + url.split('/')[-1]
    try:
        if not os.path.exists(root):
            os.mkdir(root)
        if not os.path.exists(path):
            r = requests.get(url)
            with open(path, 'wb') as f:
                f.write(r.content)
                f.close()
                print('图片保存成功')
        else:
            print('图片已存在')
    except:
        print('爬取失败')

if __name__ == '__main__':
    download_image()

6、IP地址归属地查询

138 IP地址归属地查询接口：http://m.ip138.com/ip.asp?ip=ipaddress
查询百度网站的ip地址，在cmd中输入：nslookup www.baidu.com

import requests

def ip_attribution():
    """
    IP地址归属地查询, 使用138的接口查询百度的IP归属地
    """
    url = 'http://m.ip138.com/ip.asp?ip='
    ip = '14.215.177.39'
    try:
        r = requests.get(url+ip, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        print(r.text)
    except:
        print('查询失败')

if __name__ == '__main__':
    ip_attribution()

Python 网络爬虫笔记2 -- Requests库实战

Python 网络爬虫笔记2 – Requests库实战

1、Robots 协议

2、京东商品页面的爬取

3、亚马逊商品页面的爬取

4、百度/360搜索关键字提交

5、网络图片的爬取和存储

6、IP地址归属地查询

猜你喜欢