[Python][爬虫05]Fiddler与HTTP请求：抓取豆瓣评分以及资源链接（二）

>资源抓取

上一篇给出了如何在豆瓣抓取相应电影信息的代码，这次要利用抓取到的信息，到各个资源站点抓取下载地址。

以该网站为例，通过搜索框搜索“让子弹飞”，发现在提交表单内容后，发送了一个post请求，然后被重定向到了新的地址：http://www.loldyttw.net/e/search/result/?searchid=274，光看这个url，似乎找不到任何跟“让子弹飞”有关的信息，那么我们该怎么用python模拟Http请求来获取的相应的页面呢？

>Fiddler

fiddler是一款免费的抓包软件，很容易下载到。利用这个软件，对提交表单后的post请求进行抓包：

可以看到，表单提交了的参数一共有五个：

其中，keyboard的值就是“让子弹飞”的GB2312编码经过UrlEncode后的值，submit的值则是“搜索一下”（Html页面中表单提交按钮的value值），我们所要关心的，就是keyboard的值。

那么它调用的接口又是哪个呢？从第一个图中，可以很简单的读到：POST /e/search/index.php HTTP/1.1。所以这个接口就是：http://www.loldyttw.net/e/search/index.php。

>伪造请求并解析页面

我们需要根据以上的分析，伪造一个对该接口的post请求。首先是请求头:

header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) '
                      'AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0',
        'Content-Type': 'application/x-www-form-urlencoded'  # 表单提交
    } # MyBlog @See http://blog.csdn.net/shenpibaipao

接着，是参数表（我的python脚本页面编码为utf-8）：

data = {
        'show': 'title,newstext',
        'tbname': 'download',
        'tempid': '1',
        'keyboard': title.encode('gbk'),  # 需要从utf8转为unicode才能转为gb2312，但gb2312有部分生僻字无法编码故使用gbk
        'submit': u'搜索一下'.encode('gb2312')  # python脚本页面编码为utf-8，所以这里手动写成unicode
    }
# MyBlog @See http://blog.csdn.net/shenpibaipao

之后的内容就很简单了，在[爬虫01]~[爬虫03]已经做了大量类似的页面分析，不在赘述，只要注意好相应的中文编码问题，就可以很简单的写出代码了。

>代码

# coding=utf-8
import urllib
import requests
from bs4 import BeautifulSoup

# MyBlog @See http://blog.csdn.net/shenpibaipao
def get_lol_movie_ftp_link(title):
    """
    该网站采用GB2312作为url的编码
    :param title: 要搜索的电影名
    :return: 资源链接
    """
    search_api = 'http://www.loldyttw.net/e/search/index.php' # Api接口
    movie_url = 'http://www.loldyttw.net' # 具体电影根地址
    # 参数表
    data = {
        'show': 'title,newstext',
        'tbname': 'download',
        'tempid': '1',
        'keyboard': title.encode('gbk'),  # 需要从utf8转为unicode才能转为gb2312，但gb2312有部分生僻字无法编码故使用gbk
        'submit': u'搜索一下'.encode('gb2312')
    }
    # 请求头
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) '
                      'AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0',
        'Content-Type': 'application/x-www-form-urlencoded'
    }
    response = requests.post(search_api, data=urllib.urlencode(data), headers=header)
    bs = BeautifulSoup(response.content, "lxml")
    # 以下为页面解析
    for _ in bs.select('ol label a'):
        movie_url = movie_url+_['href']
        break
    response = requests.get(movie_url)
    bs = BeautifulSoup(response.content, "lxml")
    for i in bs.select('.downurl a'):
        if i['href'] is not None:
            return i['href']  # 暂时只取第一条资源链接

给这个函数输入电影的标题名title，就能抓取该网站的资源链接。结合上一篇中的豆瓣爬虫，就能实现抓取豆瓣高分电影并自动得到资源链接的效果了：