colly第二课 visit 流程的分析

一、上节课回顾

上节课，我们对colly爬虫框架做了总体介绍，知道了colly是什么，能做什么，以及怎么用？
回调函数的概念
并通过一个例子basic.go 来了解colly的基本用法

使用方法如下：

1.创建一个Collector收集器

2.设置Collector收集器的属性

3.注册回调函数，如OnHtml，OnRequest等

4.调用Visit函数开始爬取数据

二、本节课内容：colly的visit函数的源代码分析

2.1 阅读代码之前的准备工作

1.向自己来提问问题

相信经过上节课，大家心里有疑问，colly到底是如何来爬取web网页的数据呢？

扫描二维码关注公众号，回复： 6000874 查看本文章

答：res, err := h.Client.Do(request)

回调函数c.OnRequest和c.OnHTML是在何时被调用的？

答：fetch函数

爬虫是如何处理web网页的编码格式的？

答：err = response.fixCharset(c.DetectCharset, request.ResponseCharacterEncoding)

2.带着心里的问题在源代码中寻找答案

3.看完今天的代码以后，我会有自己的思考以及新的问题

那么下次看代码的时候，我就知道去寻找我问题的答案

2.2 阅读代码的关键点

0.最好能把代码跑起来，了解程序运行的功能，方法（单步执行或者打日志）

1.先梳理代码的主要流程，等了解大致流程后，深入分析代码细节

西瓜和芝麻一起抓，最后就会啥都丢，舍得舍得，有舍才能有得。

2.务必看注释，开源项目，有完善的文档，有很可读性的注释，社区的活跃性也很重要

3.看代码的过程中，遇到知识盲点也是正常的，查找知识点了解个大概，然后继续往下读

这块最好不要浪费太多的时间

2.3 源代码走读

HTTP HEAD方法，可以获取属性

requestData 进行HTTP PUT，POST请求，来上传数据，比如用户登录信息，或者用户数据

GET请求时，requestData一般为空

域名黑名单和白名单

isDomainAllowed函数是黑白名单判断

isDomainAllowed判断url是否在黑名单

if method != "HEAD" && !c.IgnoreRobotsTxt {

if err = c.checkRobots(parsedURL); err != nil {

return err

}

这个robots是个什么东西，我不知道，好的，现在我们搜索一下

原来robots是一种协议，这个我们用不到，暂时先不看。

http.Header结构体分析

OnError

请求过程中如果发生错误被调用

OnResponse

收到回复后被调用

rc, ok := requestData.(io.ReadCloser)

if !ok && requestData != nil {

rc = ioutil.NopCloser(requestData)

}

判断rc, ok := requestData.(io.ReadCloser)

判断requestData是否能够转换成io.ReadCloser类型

如果能转换成功，则为true

context的使用场景？

http的格式以及报文

// For incoming requests, the Host header is promoted to the

// Request.Host field and removed from the Header map.

http头部字段中的Host，晋升成了Request.Host 中的Host字段

如果是异步的操作，那么就开协程来进行抓取

讲解colly自己的request结构体

// Request is the representation of a HTTP request made by a Collector

type Request struct {

// URL is the parsed URL of the HTTP request

URL *url.URL

// Headers contains the Request's HTTP headers

Headers *http.Header

// Ctx is a context between a Request and a Response

Ctx *Context

// Depth is the number of the parents of the request

Depth int

// Method is the HTTP method of the request

Method string

// Body is the request body which is used on POST/PUT requests

Body io.Reader

// ResponseCharacterencoding is the character encoding of the response body.

// Leave it blank to allow automatic character encoding of the response body.

// It is empty by default and it can be set in OnRequest callback.

ResponseCharacterEncoding string

// ID is the Unique identifier of the request

ID uint32

collector *Collector

abort bool

baseURL *url.URL

// ProxyURL is the proxy address that handles the request

ProxyURL string

}

通过观察网页来查看ResponseCharacterEncoding代表的是编码格式

utl-8,gbk编码格式等

colly第二课 visit 流程的分析

猜你喜欢