Python爬虫基础(一）

最近在学习python，顺便了解一下网络爬虫，整理了一下爬虫基础（基于py2.7）：

获取网页数据的三种方法：

# encoding=utf-8

import urllib2

def download1(url):
    return urllib2.urlopen(url).read()
    # read()方法是默认获取全部数据
    # read(100)方法是获取前100个字符

def download2(url):
    return urllib2.urlopen(url).readlines()

def download3(url):
    response = urllib2.urlopen(url)
    while True:
        line = response.readline()
        if not line:
            break

        print line

url = "http://wwww.baidu.com"
print download3(url)

基于urllib2框架，这个比较简单。

伪装浏览器

现在很多网站为了防止数据被爬取，都使用了反爬虫措施。为了在这种情况下能继续使用爬虫，目前所学有两种方案：一是添加随机的Header，二是使用框架进行模拟，其实意思都差不多。
添加随机的header：

import urllib2


def download(url):
    # header = {"User-Agent": "User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)"}
    header = {"User-Agent": "User-Agent: UCWEB7.0.2.37/28/999"}

    request = urllib2.Request(url=url, headers=header)
    # add another header
    request.add_header("name", "zhangsan")

    # open the request
    response = urllib2.urlopen(request)

    print "result:" + str(response.code)
    print  response.read()


download("http://www.baidu.com")

一般我们可以使用随机数进行header的选取，就像上面的分别模拟了IE浏览器和手机UC浏览器进行访问。当然了，网上有很多User-Agent，大家可以随机选取一下，进行随机模拟，下面是网上摘抄的一部分记录，大家可以使用：

pcUserAgent = {
"safari 5.1 – MAC":"User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"safari 5.1 – Windows":"User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"IE 9.0":"User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
"IE 8.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"IE 7.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"IE 6.0":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Firefox 4.0.1 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Firefox 4.0.1 – Windows":"User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera 11.11 – MAC":"User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera 11.11 – Windows":"User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Chrome 17.0 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Maxthon":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Tencent TT":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"The World 2.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"The World 3.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"sogou 1.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"360":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Avant":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Green Browser":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"
}

mobileUserAgent = {
"iOS 4.33 – iPhone":"User-Agent:Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPod Touch":"User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPad":"User-Agent:Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Android N1":"User-Agent: Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android QQ":"User-Agent: MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android Opera ":"User-Agent: Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Android Pad Moto Xoom":"User-Agent: Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"BlackBerry":"User-Agent: Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"WebOS HP Touchpad":"User-Agent: Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Nokia N97":"User-Agent: Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Windows Phone Mango":"User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UC":"User-Agent: UCWEB7.0.2.37/28/999",
"UC standard":"User-Agent: NOKIA5700/ UCWEB7.0.2.37/28/999",
"UCOpenwave":"User-Agent: Openwave/ UCWEB7.0.2.37/28/999",
"UC Opera":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999"
}

第二种方法，可以使用selenium网络测试框架，进行远程访问，简单代码如下：

import selenium   #网络测试框架
import selenium.webdriver   #模拟浏览器访问

def getJobNumberByName(name):
     target_url = "http://www.baidu.com"
     driver = selenium.webdriver.Chrome()  # 模拟浏览器请求,
     driver.get(target_url)  # 模拟访问连接
     page_source = driver.page_source  # 获取网页信息
     print page_source

selenium可以查看这里简介，它会调用OS上的浏览器driver，如果没有相应的配置，则会报下面的错误：
这里写图片描述
如果你使用的webdriver.Chrome(),那么需要一个chromedriver，可以下载解压，获取地址，更改代码为：

driver = selenium.webdriver.Chrome(chrom_driver_path) # 模拟浏览器请求,

编码统一

主要涉及到是中文传输，中文在传输的过程中如果不采取编码，那么服务器接受到的内容将会是乱码。编码方式如下：

import urllib
words = {"name":"zhangsan","address":"上哈"}

print urllib.urlencode(words)  #url编码
print  urllib.unquote(urllib.urlencode(words)) #url解码

get/post请求

get与post请求主要是参数的传递方式不同，get直接在url后面添加参数，而post则将参数封装在请求体中。
使用python的flask框架创建一个简单的server：

app = Flask(__name__)

@app.route('/')
def hello_world():
    return 'Hello World!'

@app.route("/login", methods=["POST"])
def login():
    name = request.form.to_dict().get("name", "")
    age = request.form.to_dict().get("age", "")
    return name + "-------" + age

@app.route("/query", methods=["GET"])
def query():
    age = request.args.get("age", "")
    return "this age is " + age


if __name__ == '__main__':
    app.run(
        "127.0.0.1",
        port=8090
    )

那么get请求为：

import  urllib2

words = {"age" : "23"}
request = urllib2.Request(url="http://127.0.0.1:8090/query?" + urllib.urlencode(words))
response = urllib2.urlopen(request)

print response.read()

post请求为：

import urllib2

info = {"name":"Tom张","age":"20"}
info = urllib.urlencode(info)  # 这是也需要进行url编码

request = urllib2.Request("http://127.0.0.1:8090/login")
request.add_data(info)
response = urllib2.urlopen(request)

print response.read()

图片下载

import urllib

urllib.urlretrieve(图片原始地址,图片本地保存地址)

代理与本地代理

多个爬虫使用单个ip时，如果此时IP地址被禁止，那爬虫就没法正常工作了，所以这也衍生了不少生态链，某宝上搜索”vps“等关键字,可以看到各种专业代理，如下图：
这里写图片描述

当然我们可以使用免费的代理：[代理更新时间为18年4月21号 16点14分]

https://www.kuaidaili.com/free/  ## 快代理
http://www.xicidaili.com/        ## 西刺代理

这里写图片描述

python代码使用代理为：

import urllib2

http_proxy = urllib2.ProxyHandler({"http":"117.90.3.126:9000"})  #代理ip与端口
opener     = urllib2.build_opener(http_proxy)
request    = urllib2.Request("http://www.baidu.com")
response   = opener.open(request)

print response.read()

重定向

1.判断url是否被重定向了：

import urllib2


# 判断url是否重定向了
 def url_is_redirect(url):
   response = urllib2.urlopen(url)
   return response.geturl() != url

print url_is_redirect("http://www.baidu.cn")

2.如果是重定向，那我们需要获取新的地址：

class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        res = urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
        res.status = code             # 返回的编码
        res.newurl = res.geturl()     # 当前的URL
        print res.newurl, res.status  # 查看重定向url
        return res

opener = urllib2.build_opener(RedirectHandler)
opener.open("http://www.baidu.cn")

cookie相关

网页的关联性获取，需要用到cookie。
1.cookie的获取：

# encoding=utf-8
import urllib2
import cookielib

#create a cookie object
cookie = cookielib.CookieJar()
#get the cookie
header = urllib2.HTTPCookieProcessor(cookie)
#deal the cookie
opener = urllib2.build_opener(header)

response = opener.open("http://www.baidu.com")

for data in cookie :
    print data.name + "--" + data.value + "\r"

获取的结果为：

BAIDUID--2643F48FC95482FF4ECAD2EBC7DBE11E:FG=1
BIDUPSID--2643F48FC95482FF4ECAD2EBC7DBE11E
H_PS_PSSID--1466_21088_18560_22158
PSTM--1524360190
BDSVRTM--0
BD_HOME--0

2.cookie的读取：

# encoding=utf-8

import urllib2
import cookielib

file_path = "cookie.txt"
cookie = cookielib.LWPCookieJar(file_path)  # 设定路径
header = urllib2.HTTPCookieProcessor(cookie)  # 设置cookie,与网站有关
opener = urllib2.build_opener(header)
response = opener.open("http://www.baidu.com")

cookie.save(ignore_expires=True, ignore_discard=True)

运行之后，cookie.txt将会存入我们的cookie文件

基本上就这些差不多了，剩下的慢慢再上来更新吧。