数据工程之一（数据采集）

数据工程的总体步骤：

步骤	方法
采集	爬虫
存储	数据库
清洗	python
分析	算法
可视化	web

一．数据采集

数据有很多种获取方式．在这里，主要学习一种里用python从网页获取数据的方式：爬虫．这种方法的优势在于：对环境的依赖性较低．

二．数据类型和结构

数据类型：

静态数据
动态数据（带有时间戳）

数据结构：

TXT:纯文本．
CVS：逗号分割．类比python中列表［］
JSON：键值对．类比python中字典｛｝
数据库：结构化数据．WAMP:windows 下ＭySQL包

这里的数据结构主要以网络数据进行分类，区别于传统的数据结构．

一些前期计算机网络知识：

url:资源定位符
http:目前最常用的web协议
url返回值类型:

html页面
API

浏览器请求服务器方式：

get:
post:

例子１：发起Get请求

import urllib
import json
import sys
from bs4 import BeautifulSoup

# 定义一个字符串变量，保存要访问的链接
url = 'http://kaoshi.edu.sina.com.cn/college/scorelist?tab=batch&wl=1&local=2&batch=&syear=2013'

# 发起请求
request = urllib.request.Request(url=url)
# 打开连接
# 超过20秒未响应则超时
response = urllib.request.urlopen(request, timeout=20)
# 读取返回内容
result = (response.read())

print(result.decode())
print("typebefore decode is",type(result))
print("typebefore afterdecode is",type(result.decode()))

这里注意，python中默认使用unicode作为字符集．方法encode编码，产生类型为byte对象．方法decode解码，产生类型为str对象．上述代码，对result进行解码后，可以得到网页中经过渲染后html中元素代码．

例子２：发起Post请求

url = 'https://shuju.wdzj.com/plat-info-target.html'

# 将参数进行编码，以字典形式组织参数
data = urllib.parse.urlencode({
	'target1': 1,
	'target2': 0,
	'type': 1,
	'wdzjPlatId': 59
})

data=data.encode();#post data should be byte type

# 发起请求
request = urllib.request.Request(url)
# 建立一个opener
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor())
# 打开连接
response = opener.open(request, data)
# 读取返回内容
result = response.read()

print(result.decode())
print("typebefore decode is",type(result))
print("typebefore afterdecode is",type(result.decode()))

for key in json.loads(result.decode()).keys():#convert json construction str to dictionary
    print(key)

post中，网络参数’data’要查询相应爬取的html中的参数值．

解析html:BeautifulSoup库
解析API:json库

数据工程之一（数据采集）

一．数据采集

二．数据类型和结构

ps:以上学习内容来自阿里天池教程link

猜你喜欢