网络爬虫与信息提取

0x00 Requests库

Requests库的主要方法

方法	说明
requests.request()	构造一个请求(主方法)
requests.get()	获取HTML网页
requests.head()	获取网页头信息
requests.post()	向网页提交POST请求
requests.put()	向网页提交PUT请求
requests.patch()	向网页提交局部修改请求
requests.delete()	向网页提交删除请求

Response对象的属性

属性	说明
r.status_code	HTTP请求的返回状态
r.text	HTTP响应内容的字符串形式
r.encoding	header中猜测的响应内容编码方式
r.apparent_encoding	从内容分析出的响应内容编码方式
r.content	HTTP响应内容的二进制形式

Requests库的异常

异常	说明
requests.ConnectionError	网络连接错误异常
requests.HTTPError	HTTP错误异常
requests.URLRequired	URL缺失异常
requests.TooManyRedirects	重定向异常
requests.ConnectTimeout	连接远程服务器超时异常
requests.Timeout	请求URL超时异常

爬取网页的通用框架

import requests

def getHTMLText(url):
    try:
        r = requests.get(url,timeout=30)
        r.raise_for_status() #状态不是200，引发HTTPError异常
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "出错啦！"

if __name__ == "__main__":
    url = input("请输入要访问的url：")
    print(getHTMLText(url))

0x01 Robots协议

Robots Exclusion Standard：网络爬虫排除标准

协议作用

网站告知网络爬虫哪些页面可以抓取，哪些不行

协议形式

在网站根目录下的robots.txt文件

基本语法

User-agent:*
Disallow:/

遵循方式

网络爬虫

自动或人工识别robots.txt，在进行内容爬取
约束性

robots协议是建议但非约束性，网络爬虫可以不遵守，但存在法律风险
类人类行为可不参考Robots协议

0x02 Requests库实例

一、爬取京东商品信息

import requests

url = "https://item.jd.com/100006536488.html"
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取失败！")

二、爬取亚马逊商品信息

import requests

url = "https://www.amazon.cn/gp/product/B01M8L5Z3Y"
try:
	#伪造浏览器发送和请求
    kv = {'user-agent':'Mozilla/5.0'}
    r = requests.get(url,headers = kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text)
except:
    print("爬取失败！")

三、360搜索提交关键字

import requests

keyword = 'python'
try:
    kv = {'q':keyword}
    r = requests.get("https://www.so.com/s",params=kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text)
except:
    print("爬取失败！")

四、图片的爬取与存储

import requests
import os

url = "http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg"
root = "E://911208//网易云音乐//image//"
path = root + url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功！")
    else:
        print("文件已存在！")
except:
    print("爬取失败")

五、IP地址归属地查询

import requests

url = "http://m.ip138.com/ip.asp?ip="
try:
    r = requests.get(url+'202.204.80.112')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-500:])
except:
    print("爬取失败！")

0x03 Beautiful Soup库

Beautiful Soup库解析器

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,‘html.parser’)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,‘html5lib’)	pip install html5lib

Beautiful Soup类的基本元素

基本元素	说明
Tag	标签
Name	标签的名字
Attributes	标签的属性
NavigableString	标签内非属性字符串
Comment	标签内字符串的注释部分

标签树的下行遍历

属性	说明
.contents	将tag所有儿子节点存入列表
.children	与.contents类似，用于循环遍历儿子结点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

标签树的上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

标签树的平行遍历

属性	说明
.next_sibling	返回下一个平行节点标签
.previous_sibling	返回上一个平行节点标签
.next_siblings	返回后序所有平行节点标签
.previous_siblings	返回前序所有平行节点标签

信息提取的一般方法

完整解析信息的标记形式，在提取关键信息
无视标记信息，直接提取关键信息
结合形式解析与搜索方法，提取关键信息

0x04 Beautiful Soup库实例

中国大学排名

import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:        
        return ""

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html,"html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[2].string])

def printUnivList(ulist, num):
    tplt = '{0:^10}\t{1:{3}^10}\t{2:^10}'
    print(tplt.format('排名', '学校名称', '地址', chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))

def main():
    uinfo = []
    url = "http://zuihaodaxue.cn/zuihaodaxuepaiming2019.html"
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 300)
main()

0x05 正则表达式

正则表达式是用来简洁表达一组字符串的表达式。

通用字符串表达框架
简洁表达一组字符串的表达式
针对字符串表达简洁和特征思想的工具
判断某字符串的特征归属

文本处理：

表达文本类型的特征
同时查找或替换一组字符串

正则表达式常用操作符

操作符	说明	实例
.	表示任何单个字符
[ ]	字符集，对单个字符给出取值范围	[abc]表示a、b、c,[a-z]表示 a到z的单个字符
[^ ]	非字符集，对单个字符给出排除范围	[^abc]表示非a或b或c的单个字符
*	前一个字符0次或无限次扩展	abc*表示ab、abc、abcc、abccc等
+	前一个字符1次或无限次扩展	abc+表示abc、abcc、abccc等
?	前一个字符0次或1次扩展	abc?表示ab、abc
\|	左右表达式任意一个	abc\|def表示abc、def
{m}	扩展前一个字符m次	ab{2}c表示abbc
{m,n}	扩展前一个字符m至n次	ab{1,2}c表示abc、abbc
^	匹配字符串开头	^abc表示abc且在一个字符串的开头
$	匹配字符串结尾	abc$表示abc且在一个字符串的结尾
( )	分组标记，内部使用\|操作符	(abc\|def)表示abc、def
\d	数字，等价于[0-9]
\w	单词字符，等价于[A-Za-z0-9]

经典正则表达式实例

^[A-Za-z]+$				//由26个字母组成的字符串
^[A-Za-z0-9]+$			//由26个字母和数字组成的字符串
^-?\d+$ 		    	//整数形式的字符串
^[0-9]*[1-9][0-9]*$	     //正整数形式的字符串
[1-9]\d{5}			    //中国境内邮政编码，6位
[\u4e00-\u9fa5]			//匹配中文字符
\d{3}-\d{8}|\d{4}-\d{7}  //国内电话号码

0x06 Re库

Re库是Python的标准库，主要用于字符串匹配，默认采用贪婪匹配，即输出匹配的最大字串。

扫描二维码关注公众号，回复： 9898889 查看本文章

如需最小匹配，可在操作符后加’?'表示输出最小匹配。

正则表达式的表示类型

raw string类型（原生字符串类型）
string类型（需要’‘转义’\d’等）

Re库的两种用法

函数式用法：一次性操作

rst = re.search(r'[1-9]\d{5}', 'BIT 100081')

面向对象用法：编译后的多次操作

pat = re.compile(r'[1-9]\d{5}')
rst = pat.search('BIT 100081')

Re库主要功能函数

函数	说明
re.search()	在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象
re.match()	从一个字符串的开始位置起匹配正则表达式，返回match对象
re.findall()	搜索字符串，以列表类型返回全部能匹配的子串
re.split()	将一个字符串按照正则表达式匹配结果进行分割，返回列表类型
re.finditer()	搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素是match对象
re.sub()	在一个字符串中替换所有匹配正则表达式的字串，返回替换后的字符串

0x07 Re库实例

一、淘宝商品比价定向爬虫

import requests
import re

def getHTMLText(url):
    try:
        header = {
            'User-Agent': 'Mozilla/5.0',
            'cookie': 'thw=cn; isg=BMHBPkhhe1tnBZdn4oW3NuPq0w3b7jXgVDwiVyMWnkgnCuHcaz6vsM8I6P5MGc0Y; cna=ERS7FiBJnl0CAXWWtw3DaILO; t=d20c989f94e109d35897baef41022d70; l=cBgQ3TRRQYql2zNCBOCNVuI8as79OIOYYuPRwN0Xi_5Nt6L1f97Oo5xcQFp6csWd9g8B4zotpap9-etkiDt6Qt--g3fP.; uc3=id2=VyyVyIaXGLNIDQ%3D%3D&nk2=F5RDKXHyOAaXshI%3D&lg2=W5iHLLyFOGW7aA%3D%3D&vt3=F8dBxdzwHz9KDFZYuFA%3D; lgc=tb654825781; uc4=nk4=0%40FY4I6gVSHObwqYGig5lEYrmSBfzs0A%3D%3D&id4=0%40VXtaRp7R7jWArSR8hervDYzL5XaI; tracknick=tb654825781; _cc_=UIHiLt3xSw%3D%3D; tg=0; mt=ci=7_1; enc=rZfi7zAhVsMoQi2YeTWL7H7gE%2BxulQNht0LC6l6eJwZuKa345jaCDE64p7WhiEQQsOVT7LZl4S4CCMxMEWeAMw%3D%3D; hng=CN%7Czh-CN%7CCNY%7C156; cookie2=157366bac664d23b6a09667d9ec41da5; v=0; _tb_token_=53b53bb70a5e7; uc1=cookie14=UoTUOLPGs7ssvw%3D%3D; JSESSIONID=4E2E3B16DCC8C679A015D932188B50BD; alitrackid=www.taobao.com; lastalitrackid=www.taobao.com'}
        r = requests.get(url, headers=header, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def parsPage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price, title])
    except:
        print("")

def printGoodList(ilt):
    tplt = '{:4}\t{:8}\t{:16}'
    print(tplt.format('序号', '价格', '商品名称'))
    count = 0
    for j in ilt:
        count += 1
        print(tplt.format(count, j[0], j[1]))

def main():
    goods = '书包'
    start_url = 'https://s.taobao.com/search?q=' + goods
    depth = 2
    infolist = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
            html = getHTMLText(url)
            parsPage(infolist, html)
        except:
            continue
    printGoodList(infolist)

main()

二、股票数据定向爬虫

import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url, code="utf-8"):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return ""

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL, "GB2312")
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})

            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名称': name.text.split()[0]})
            
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
            
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
                count = count + 1
                print("\r当前进度: {:.2f}%".format(count*100/len(lst)),end="")
        except:
            count = count + 1
            print("\r当前进度: {:.2f}%".format(count*100/len(lst)),end="")
            continue

def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'E:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()

0x08 Scrapy爬虫框架

Scrapy常用命令

命令	说明
Scrapy startproject	创建一个新工程
Scrapy genspider	创建一个爬虫
Scrapy Settings	获取爬虫配置信息
Scrapy crawl	运行一个爬虫
Scrapy list	列出工程中所有爬虫
Scrapy shell	启动URL调试命令行

Scrapy爬虫的使用步骤

创建一个工程和Spider模板
编写Spider
编写Item Pipeline
优化配置策略

Scrapy爬虫的数据类型

Request类

Request对象表示一个Http请求，由Spider生成，由Downloader执行

属性或方法	说明
.url	Request对应的请求URL地址
.method	对应的请求方法
.headers	字典类型风格的请求头
.body	请求内容主体，字符串类型
.meta	用户添加的扩展信息，在Scrapy内部模块间传递信息使用
.copy()	复制该请求

Response类

Response对象表示一个HTTP响应，由Downloader生成，由Spider处理

属性或方法	说明
.url	Response对应的URL地址
.status	HTTP状态码，默认200
.headers	Response对应的头部信息
.body	Response对应的内容信息，字符串类型
.flags	一组标记
.request	产生Response类型对应的Request对象
.copy()	复制该响应

Item类

Item对象表示一个从HTML页面中提取的信息内容，由Spider生成，由Item Pipeline

0x09 Scrapy爬虫实例：股票数据

stocks.py

# -*- coding: utf-8 -*-
import scrapy
import re


class StocksSpider(scrapy.Spider):
    name = "stocks"
    start_urls = ['http://quote.eastmoney.com/stocklist.html']

    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            try:
                stock = re.findall(r"[s][hz]\d{6}", href)[0]
                url = 'https://gupiao.baidu.com/stock/' + stock + '.html'
                yield scrapy.Request(url, callback=self.parse_stock)
            except:
                continue

    def parse_stock(self, response):
        infoDict = {}
        stockInfo = response.css('.stock-bets')
        name = stockInfo.css('.bets-name').extract()[0]
        keyList = stockInfo.css('dt').extract()
        valueList = stockInfo.css('dd').extract()
        for i in range(len(keyList)):
            key = re.findall(r'>.*</dt>', keyList[i])[0][1:-5]
            try:
                val = re.findall(r'\d+\.?.*</dd>', valueList[i])[0][0:-5]
            except:
                val = '--'
            infoDict[key]=val

        infoDict.update(
            {'股票名称': re.findall('\s.*\(',name)[0].split()[0] + \
             re.findall('\>.*\<', name)[0][1:-1]})
        yield infoDict

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class BaidustocksPipeline(object):
    def process_item(self, item, spider):
        return item

class BaidustocksInfoPipeline(object):
    def open_spider(self, spider):
        self.f = open('BaiduStockInfo.txt', 'w')

    def close_spider(self, spider):
        self.f.close()

    def process_item(self, item, spider):
        try:
            line = str(dict(item)) + '\n'
            self.f.write(line)
        except:
            pass
        return item

settings.py

BOT_NAME = 'BaiduStocks'

SPIDER_MODULES = ['BaiduStocks.spiders']
NEWSPIDER_MODULE = 'BaiduStocks.spiders'

ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {   'BaiduStocks.pipelines.BaidustocksInfoPipeline': 300,
}

0xdawn

发布了34 篇原创文章 · 获赞 33 · 访问量 4898

私信关注