爬虫系列一

一 requests库

1.1.requests库
1.2.requests函数
1.3.Response对象
1.4 Request对象，访问对象
1.4.通用框架
1.5.HTTP协议

二网络爬虫

2.1.爬虫尺寸
2.2.爬虫的规范
2.3.Robots协议，网络爬虫排除标准
2.4.Robots协议的遵守方式
2.5.练习

三 Beautiful Soup库入门

3.1.BeautifulSoup库的基本元素

BeautifulSoup 类的基本元素

3.2.基于bs4库的HTML内容遍历方法

标签树的下行遍历
标签树的上行遍历
标签树的平行遍历

3.3.html格式输出

四信息的标记

4.1.信息提取的一般方法

五练习

一切为了数据挖掘的准备
在中国大学MOOC网站上学习的北京理工大学嵩天老师的免费爬虫课程课件，简单易懂，感兴趣的戳嵩天老师爬虫课程。侵删

一 requests库

1.1.requests库

>>> r=requests.get('http://www.baidu.com')
>>> r.status_code
200
>>> r.encoding='utf-8'
>>> type(r)
<class 'requests.models.Response'>
>>> r.text

1.2.requests函数

requests.request(method,url,**kwargs)
- method:get/put/post/head/patch/delete/options
- 参数:
  - params:字典或字典序列，作为参数增加到url中
```
url?key1=value1&key2=value2
```
  - data:字典字节或文件，作为request内容
  - json:作为request的内容
  - headers:字典，头字段
    {‘user-agent’:‘Chrome/10’}
  - cookies:字典或cookiejar
  - auth:认证
  - files:传输文件
```
fs = {'file':open('文件名','rb')}
r = requests.request(url,file=fs)
```
  - timeout:,s为单位
  - proxies:字典类型，设定当问代理服务器，增加登陆认证.有效隐藏用户爬虫源的IP地址，防止反追踪
```
pxs = {'http':'','https':''}
```

其他方法都用requests.request方法封装

requests.get(url,params=None,**kwargs)，返回Response对象
Request对象
head() 仅获得网页头信息的方法，用少量

>>> r = requests.head('http://www.baidu.com')
>>> r.headers
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Wed, 27 Mar 2019 03:37:14 GMT', 'Last-Modified': 'Mon, 13 Jun 2016 02:50:04 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18'}
>>> r.text
''

post()，增加数据

data={'aaa':'haha'}
r=request.post(url,data)

put(),覆盖原有数据
patch(),向html网页提交局部修改的请求
delete()，提交删除请求

1.3.Response对象

属性：

r.status_code,http请求的返回状态，200连接成功，404失败
r.text,响应内容的字符串形式
r.encoding,http header中猜测的响应内容编码方式。如果header中不存在charser，认为编码默认iso-8859-1
r.apparent_encoding,从内容中分析出的响应内容编码方式
r.content,响应内容的二进制形式
r.headers,头部信息，返回网页的头信息
r.request,Request对象（r.request.headers）

1.4 Request对象，访问对象

headers属性，访问时的头信息，有User-Agent等

1.4.通用框架

网络连接有风险，需要处理异常

requests.ConnectionError,网络连接错误，DNS查询失败，拒绝连接
requests.HTTPError,HTTP错误一场
requests.URLRquired,URL确实
requests.TooManyRedirects,超过最大重定向次数，产生重定行异常，复杂链接访问时
requests.ConnectTimeout,远程连接服务器超时，仅指连接
requests.Timeout,请求URL超时，发出连接到获得

方法：

r.raise_for_status(),如果状态不是200，产生异常requests.HTTPError

def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "产生异常"

1.5.HTTP协议

超文本出阿叔协议，基于“请求与响应”的无状态（前一次后后一次访问间无关系）的应用层协议。采用url做标识

url格式: hrrp://host[:port][path]
- host:合法的internet主机域名或IP地址
- port:端口号，默认80
- path:资源的内部路径
  http://220.181.111.188/duty
  http://www.baidu.com
http协议对资源的操作
- get:请求获取url位置的资源
- head:请求获取url位置资源的相应消息报告，即获取该资源的头部信息
- post:附加新数据，不改变原有资源
- put:请求存储一个资源，覆盖原url位置资源
- patch:请求局部更新url位置的资源，即改变该处资源的部分内容，可以节省网络带宽
- delete:请求删除存储的资源

二网络爬虫

2.1.爬虫尺寸

小规模，数据量小，爬去速度不敏感，requests库
中规模，规模大，速度敏感，Scrapy库，爬取系列网站
大规模，搜索引擎，爬取速度关键，定制开发，爬取全网

2.2.爬虫的规范

通过来源审查:http头部的User-Agent进行限制
检查来访HTTP协议头的User-Agent域，之相应浏览器或友好爬虫的访问
发布公告：Robost协议
告知所有爬虫网站的爬取策略，要求爬虫遵守

2.3.Robots协议，网络爬虫排除标准

作用：网站告知爬虫哪些页面可以抓取，哪些不行
形式：在网站根目录下的robots.txt文件
如果网站没有robots.txt,不限制爬虫

# *代表所有，/根目录
User-Agent:*   //对所有的用户
Disallow: /?*  //不能访问此目录下的资源

2.4.Robots协议的遵守方式

使用：自动或人工识别robots.txt,再进行内容爬取
约束性：建议非约束，但有法律风险。
当类人类行为可不参考，例如访问量不大，时间短

2.5.练习

当访问网页受限时，有可能没有遵守Robots协议，有可能对访问对象（通过request.headers中的User-Agent）受限（来源审查）

kv={'user-agent':'Mozilla/5,0'}  //假装来自网页
r = requests.get(url, headers=kv)

从百度或360搜索引擎搜索
- 百度关键词搜索：http://www.baidu.com/s?wd=keyworld
- 360关键词搜索:http://www.so.com/s?q=keyworld

kv={'wd':'苹果的品种'}
r = request.get('http://www.baidu.com/s',params=kv)
r.status_code
r.request.url
len(r.text)

网络图片的爬去和存储
图片链接的格式 url/picture.jpg

import requests
import os
url='https://ss2.baidu.com/-vo3dSag_xI4khGko9WTAnF6hhy/image/h%3D300/sign=8ad822e9df00baa1a52c41bb7711b9b1/0b55b319ebc4b745564c5813c1fc1e178b8215de.jpg'
root='C:/Users/lenovo/Desktop'
path = root + url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(f,'wb') as f:
            f.write(r.content)
            f.close()
            print('文件保存成功')
    else:
        print("文件已存在")
except:
    print("爬取失败")

IP地址归属地的自动查询
www.iP138.com 提供查询

url = 'http://m.ip138.com/ip.asp?ip='
r=requests.get(url + '202.204.80.112')

r.text最好约束长度[:500]

三 Beautiful Soup库入门

对HTML XML内容解析

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('https://python123.io/ws/demo.html')
>>> r.status_code
200
>>> demo = r.text
>>> soup = BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

3.1.BeautifulSoup库的基本元素

是解析、遍历、维护“标签树”的功能库

引入

from bs4 import BeautifulSoup

解析标签树

soup = BeautifulSoup('<html>data<html>','html.parser')
soup = BeautifulSoup(open('文件名.html'),'html.parser')

解析器
- ‘html.parser’, 需要安装bs4库
- ‘lxml’, 安装 lxml库
- ‘xml’,安装lxml库
- ‘html5lib’，安装 heml5lib库

BeautifulSoup 类的基本元素

Tag，标签,最基本的信息组织单元，分别用<>和</>标明开头和结尾

获得标签

# 只能获得第一个a元素
>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

Name 标签的名字，<p>…</p>的名字是’p’

soup.a.name
soup.a.parent.name #父元素的名字

标签的属性，字典形式组织，没有属性返回空字典

>>> soup.a.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> type(soup.a)
<class 'bs4.element.Tag'>

NavigableString，标签内非属性字符串，<>…</>中字符串。.string

>>> type(soup.p)
<class 'bs4.element.Tag'>
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>

Comment,标签内字符串注释部分，一种特殊的Comment类型

>>> newsoup = BeautifulSoup('<b><!--this is a comment--></b>','html.parser')
>>> newsoup.b
<b><!--this is a comment--></b>
>>> newsoup.b.string
'this is a comment'
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>

可以通过.string的类型判断是否是注释

3.2.基于bs4库的HTML内容遍历方法

标签树的下行遍历

.contents,子节点的列表
.children,子节点的迭代类型
.descendants,子孙节点的迭代类型，包含所有的子孙节点

标签树的上行遍历

.parent，父亲节点
.parents,先辈节点的迭代类型

标签树的平行遍历

next_sibling,下一个平行节点标签
previous_sibling,上一平行节点标签
next_sibilings,后续所有的平行节点标签
previous_siblings,前续所有的平行节点标签

3.3.html格式输出

prettify()方法,为标签后面添加换行符

soup.prettify()
soup.a.prettify()

bs4库将任何html输入都变成utf-8编码

四信息的标记

标记后的信息可形成信息组织结构，增加了信息的维度。

xml
json
yaml:无类型键值对，通过缩进表示键值关系。
- 用|表达整块数据.
- 用:表示所属关系
- 用-表示并列关系
- #表示Comment
```
key : value
key : #Comment
- value1
- value2
key : 
    subkey : subvalue
```

4.1.信息提取的一般方法

soup.find_all(name,attrs,recursive,string)

for link in soup.find_all('a')
    print(link.get('href'))

name:标签名称的检索

soup.find_all('a')
soup.find_all(['a','b']) 
soup.find_all(true) #所有标签
import re
soup.find_all(re.compile('b')) #以b开头

attrs:对标签属性的检索字符串

soup.find_all(id='link')
soup.find_all(id=re.compile('link')) #属性以link开头

recursive:是否对子孙全部检索，默认True;设为False时，之检索子节点
string:<></>中字符串区域的检索字符串

soup.find_all(string=re.compile('python'))
soup.find_all(string='python')

简短形式
() 等价于.find_all()
soup() 等价于 soup.find_all()

<>.find() 只返回一个结果
find_parents(),返回列表类型
find_parent(),返回一个类型
find_next_siblings()
find_next_sibling()
find_previous_siblings()
find_previous_sibling()

五练习

爬取"http://www.zuihaodaxue.cn/zuihaodaxuepaiming2018.html" 网页，先确定robots协议，即"http://www.zuihaodaxue.cn/robots.txt"存不存在。

import requests
import bs4 #使用标签类型定义
from bs4 import BeautifulSoup

def getHTMLText(url):
    try:
        r= requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return""
    

def fillUnivList(ulist,html):
    soup = BeautifulSoup(html,'html.parser')
    for tr in soup.find('tbody').children:
        if isinstance(tr,bs4.element.Tag): #儿子有可能是<tr>标签或字符串类型
            tds = tr('td')
            ulist.append([tds[0].string,tds[1].string,tds[3].string])
    

def printUnivList(ulist,num):
    tplt= '{0:^10}\t{1:{3}^10}\t{2:^10}'
    print(tplt.format("排名","学校","打分",chr(12288))) #chr(12288)中文可以对齐
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))

def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,20)
    
main()

爬虫系列一

爬虫系列一

一 requests库

1.1.requests库

1.2.requests函数

1.3.Response对象

1.4 Request对象，访问对象

1.4.通用框架

1.5.HTTP协议

二 网络爬虫

2.1.爬虫尺寸

2.2.爬虫的规范

2.3.Robots协议，网络爬虫排除标准

2.4.Robots协议的遵守方式

2.5.练习

三 Beautiful Soup库入门

3.1.BeautifulSoup库的基本元素

BeautifulSoup 类的基本元素

3.2.基于bs4库的HTML内容遍历方法

标签树的下行遍历

标签树的上行遍历

标签树的平行遍历

3.3.html格式输出

四 信息的标记

4.1.信息提取的一般方法

五 练习

猜你喜欢

二网络爬虫

四信息的标记

五练习