简单爬虫制作(一)

一步步传自己的爬虫学习过程。
这是一个简易的爬虫，先上图
在这里插入图片描述
这是个流程图
今天学习网页下载器urllib2,视频作者使用的是python2，而我下载的是python3。导入这个的时候走了点弯路，上代码

import  urllib.request
from http import cookiejar
url ="http://www.baidu.com"

print("第一种方法")
res=urllib.request.urlopen(url)
print(res.getcode())      #打印状态码
print(len(res.read()))    #返回的网页内容长度

print("第二种方法")
request=urllib.request.Request(url)     #使用Resquest对象进行特殊的处理
request.add_header("user-agent","Mozilla/5.0")    #这里把爬虫伪装成一个浏览器
res2=urllib.request.urlopen(request)
print(res2.getcode())
print(len(res2.read()))

print("第三种方法")
cj=cookiejar.CookieJar()
opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
urllib.request.install_opener(opener)      #urllib安装opener,增加cookie的处理
res3=urllib.request.urlopen(url)
print(res3.getcode())
print(len(res3.read()))
print("打印cookie的内容")
print(cj)

下载的是百度的首页

以下是2和3之间的变化，从网上找过来的
Py2.x：

Urllib库
Urllin2库
Py3.x：
Urllib库
变化：

在Pytho2.x中使用import urllib2——-对应的，在Python3.x中会使用import urllib.request，urllib.error。
在Pytho2.x中使用import urllib——-对应的，在Python3.x中会使用import urllib.request，urllib.error，urllib.parse。
在Pytho2.x中使用cookielib.CookieJar——-对应的，在Python3.x中会使用http.CookieJar。
在Pytho2.x中使用import urlparse——-对应的，在Python3.x中会使用import urllib.parse。
在Pytho2.x中使用import urlopen——-对应的，在Python3.x中会使用import urllib.request.urlopen。
在Pytho2.x中使用import urlencode——-对应的，在Python3.x中会使用import urllib.parse.urlencode。
在Pytho2.x中使用import urllib.quote——-对应的，在Python3.x中会使用import urllib.request.quote。
在Pytho2.x中使用urllib2.Request——-对应的，在Python3.x中会使用urllib.request.Request。

简单爬虫制作(一)

猜你喜欢