python爬虫 day01

python爬虫 day01

网络爬虫

定义 : 网络蜘蛛网络机器人,抓取网络数据的程序
总结 : 用Python程序去模仿人去访问网站,模仿的越逼真越好
目的 : 通过有效的大量的数据分析市场走势,公司的决策

企业获取数据的方式

公司自有
第三方数据平台购买
数据堂贵阳大数据交易所
爬虫爬取数据
市场上没有或者价格太高,利用爬虫程序去爬取

Python做爬虫优势

Python : 请求模块,解析模块丰富成熟
PHP : 对多线程,异步支持不够好
JAVA : 代码笨重,代码量大
C/C++ : 虽然效率高,但代码成型太慢

爬虫分类

通用网络爬虫(搜索引擎引用,需要遵守robots协议):http://www.qq.com/robots.txt
1. 搜索引擎如何获取一个新网站的URL
  1. 网站主动向搜索引擎提供(百度站长平台)
  2. 和DNS服务商(万网),快速收录新网站
聚焦网络爬虫
自己写的爬虫程序 : 面向主题爬虫面向需求爬虫

爬取数据步骤

确定需要爬取的URL地址
通过HTTP/HTTPS协议来获取响应的HTML页面
提取HTML页面里有用的数据
1. 所需数据,保存
2. 页面中其他的URL,继续 2 步

Chrome浏览器插件

插件安装步骤
1.右上角 - 更多工具 - 扩展程序
2.点开开发者模式
3.把插件拖拽到浏览器界面
插件介绍
1. Proxy SwitchOmega : 代理切换插件
2. XPath Helper : 网页数据解析插件
3. JSON View : 查看json格式的数据(好看)

Filldler抓包工具

抓包设置
1.设置Filldler抓包工具
2.设置浏览器代理

Anaconda 和 spyder

anaconda : 开源的python发行版本
Spyder : 集成的开发环境
spyder常用快捷键
1. 注释/取消注释 : ctrl + 1
2. 保存 : ctrl + s
3. 运行程序 : F5

WEB

HTTP 和 HTTPS
1. HTTP : 80
2. HTTPS : 443 HTTP的升级版
GET 和 POST
1. GET : 查询参数会在URL上显示出来
2. POST : 查询参数和提交数据在form表单里,不会在URL地址上显示
URL
http:// item.jd.com :80 /26606127795.html #detail
协议域名/IP地址端口资源路径锚点

User-Agent

作用：记录用户的浏览器、操作系统等,为了让用户获取更好的HTML页面效果
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36

解释：

name	descr
Mozilla	Firefox(Gecko内核)
IE	Trident(自己内核)
Linux	KHTML(like Gecko)
Apple	Webkit(like KHTML)
google	Chrome(like webkit)

爬虫请求模块

urllib.request
1. 版本
  1. python2中：urllib 和 urllib2
  2. python3中：把2者合并,urllib.request

常用方法

urllib.request.urlopen(“URL”)

作用：

     向网站发起请求并获取响应
    urlopen() 得到的响应对象response ：bytes
    response.read().decode("utf-8") ：bytes->str

例子：

import urllib.request
url = "http://www.baidu.com/"

# 发起请求并获取响应对象

response = urllib.request.urlopen(url)

# 响应对象的read()方法获取响应内容


# read()得到的是 bytes 类型


# decode() bytes -> string

html = response.read().decode("utf-8")
print(html)

urllib.request.Request(url,headers={})

重构User-Agent,爬虫和反爬虫斗争第一步

例子：

import urllib.request

url = "http://www.baidu.com/"
headers = {"User-Agent":"Mozilla/5.0 "+ \
"(Windows; U; Windows NT 6.1; en-US) "+ \
"AppleWebKit/534.16 (KHTML, like Gecko)"+ \
" Chrome/10.0.648.133 Safari/534.16"}       

# 1.构建请求对象

request = urllib.request.Request(url,headers=headers)

# 2.获取响应对象

response = urllib.request.urlopen(request)

# 3.获取响应对象内容

html = response.read().decode("utf-8")

# 获取响应码

print(response.getcode())

# 获取响应报头信息

print(response.info())

使用步骤
1. 构建请求对象request ：Request()
2. 获取响应对象response ：urlopen(request)
3. 利用响应对象response.read().decode(“utf-8”)
4. 请求对象request方法
5. add_header()
  作用 : 添加/修改headers(User-Agent)

get_header(“User-agent”) : 只有U是大写

作用 : 获取已有的HTTP报头的值

for example:

import urllib.request

url = "http://www.baidu.com/"
headers = {"User-Agent":"Mozilla/5.0 "+ \
"(Windows; U; Windows NT 6.1; en-US) "+ \
"AppleWebKit/534.16 (KHTML, like Gecko)"+ \
" Chrome/10.0.648.133 Safari/534.16"}

# 1.构建请求对象

request = urllib.request.Request(url)

# 请求对象方法 add_header()

request.add_header("User-Agent",headers)

# 获取响应对象

response = urllib.request.urlopen(request)

# get_header()方法获取 User-Agent

print(request.get_header("User-agent"))

getcode()
1. 作用 : 返回HTTP的响应码
  200 : 成功
  4XX : 服务器页面出错
  5XX : 服务器出错

info()

作用 : 返回服务器响应的报头信息
举个栗子：

import urllib.request

url = "http://www.baidu.com/"
headers = {"User-Agent":"Mozilla/5.0 "+ \
"(Windows; U; Windows NT 6.1; en-US) "+ \
"AppleWebKit/534.16 (KHTML, like Gecko)"+ \
" Chrome/10.0.648.133 Safari/534.16"}

# 1.构建请求对象

request = urllib.request.Request(url,headers=headers)

# 2.获取响应对象

response = urllib.request.urlopen(request)

# 3.获取响应对象内容

html = response.read().decode("utf-8")

# 获取响应码

print(response.getcode())

# 获取响应报头信息

print(response.info())

urllib.parse

quote(“中文”)

例子：

import urllib.request
import urllib.parse
headers = {"User-Agent":"Mozilla/5.0 "+ \
"(Windows; U; Windows NT 6.1; en-US) "+ \
"AppleWebKit/534.16 (KHTML, like Gecko)"+ \
" Chrome/10.0.648.133 Safari/534.16"}
url = "http://www.baidu.com/s?wd="
key = input("请输入要搜索的内容:")

# 编码,拼接URL

key = urllib.parse.quote(key)
fullurl = url + key

# 构建请求对象

request = urllib.request.Request(fullurl,headers=headers)

# 获取响应对象

response = urllib.request.urlopen(request)

# read().decode()

html = response.read().decode("utf-8")
print(html)

2.urlencode(字典)

介绍：

url : wd=”美女”
d = {“wd”:”美女”}
d = urllib.parse.urlencode(d)
print(d)
结果 : wd=%E7%BE%8E%E5%A5%B3

例子：

import urllib.request
import urllib.parse

baseurl = "http://www.baidu.com/s?"
headers = {"User-Agent":"Mozilla/5.0 "+ \
"(Windows; U; Windows NT 6.1; en-US) "+ \
"AppleWebKit/534.16 (KHTML, like Gecko)"+ \
" Chrome/10.0.648.133 Safari/534.16"}
key = input("请输入要搜索的内容:")

# urlencode编码,参数一定要是字典

d = {"wd":key}
d = urllib.parse.urlencode(d)
url = baseurl + d

# 构建请求对象

request = urllib.request.Request(url,headers=headers)

# 获取响应对象

response = urllib.request.urlopen(request)

# 获取内容

html = response.read().decode("utf-8")
print(html)

unquote(“编码之后的字符串”)

import urllib.request
key=urllib.parse.quote("你好")
print(key)
dekey=urllib.parse.unquote(key)
print(dekey)

python爬虫 day01

python爬虫 day01

网络爬虫

企业获取数据的方式

Python做爬虫优势

爬虫分类

爬取数据步骤

Chrome浏览器插件

Filldler抓包工具

Anaconda 和 spyder

WEB

爬虫请求模块

猜你喜欢