3 模块开发——数据采集

3.1 需求

数据采集的需求广义上来说分为两大部分。

1）是在页面采集用户的访问行为，具体开发工作：

1、开发页面埋点js，采集用户访问行为

2、后台接受页面js请求记录日志

此部分工作也可以归属为“数据源”，其开发工作通常由web开发团队负责

2）是从web服务器上汇聚日志到HDFS，是数据分析系统的数据采集，此部分工作由数据分析平台建设团队负责，具体的技术实现有很多方式：

² Shell脚本

优点：轻量级，开发简单

缺点：对日志采集过程中的容错处理不便控制

² Java采集程序

优点：可对采集过程实现精细控制

缺点：开发工作量大

² Flume日志采集框架

成熟的开源日志采集系统，且本身就是hadoop生态体系中的一员，与hadoop体系中的各种框架组件具有天生的亲和力，可扩展性强

3.2 技术选型

在点击流日志分析这种场景中，对数据采集部分的可靠性、容错能力要求通常不会非常严苛，因此使用通用的flume日志采集框架完全可以满足需求。

本项目即使用flume来实现日志采集。

3.3 Flume日志采集系统搭建

1、数据源信息

本项目分析的数据用nginx服务器所生成的流量日志，存放在各台nginx服务器上，如：

/var/log/httpd/access_log.2015-11-10-13-00.log

/var/log/httpd/access_log.2015-11-10-14-00.log

/var/log/httpd/access_log.2015-11-10-15-00.log

/var/log/httpd/access_log.2015-11-10-16-00.log

2、数据内容样例

数据的具体内容在采集阶段其实不用太关心。

58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"

字段解析：

1、访客ip地址： 58.215.204.118

2、访客用户信息： - -

3、请求时间：[18/Sep/2013:06:51:35 +0000]

4、请求方式：GET

5、请求的url：/wp-includes/js/jquery/jquery.js?ver=1.10.2

6、请求所用协议：HTTP/1.1

7、响应码：304

8、返回的数据流量：0

9、访客的来源url：http://blog.fens.me/nodejs-socketio-chat/

10、访客所用浏览器：Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0

3.采集实现

Flume采集系统的搭建相对简单：

1、在个web服务器上部署agent节点，修改配置文件

2、启动agent节点，将采集到的数据汇聚到指定的HDFS目录中

如下图：

² 版本选择：apache-flume-1.6.0

² 采集规则设计：

1、采集源：nginx服务器日志目录

2、存放地：hdfs目录/home/hadoop/weblogs/

² 采集规则配置详情

while true
do
echo 111111 >> /home/hadoop/log/test.log
sleep 0.5
done

tail -F test.log

采集到hdfs中, 文件中的目录不用自己建的

bin/flume-ng agent --conf conf --conf-file conf/tail-hdfs.conf --name a1 -Dflume.root.logger=INFO,console

前端页面查看下, master:50070, 文件目录: /flum/events/16-04-20/

启动命令：
bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1
################################################################

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#exec 指的是命令
# Describe/configure the source
a1.sources.r1.type = exec
#F根据文件名追中, f根据文件的nodeid追中
a1.sources.r1.command = tail -F /home/hadoop/log/test.log
a1.sources.r1.channels = c1

# Describe the sink
#下沉目标
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
#指定目录, flum帮做目的替换
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
#文件的命名, 前缀
a1.sinks.k1.hdfs.filePrefix = events-

#10 分钟就改目录
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

#文件滚动之前的等待时间(秒)
a1.sinks.k1.hdfs.rollInterval = 3

#文件滚动的大小限制(bytes)
a1.sinks.k1.hdfs.rollSize = 500

#写入多少个event数据后滚动文件(事件个数)
a1.sinks.k1.hdfs.rollCount = 20

#5个事件就往里面写入
a1.sinks.k1.hdfs.batchSize = 5

#用本地时间格式化目录
a1.sinks.k1.hdfs.useLocalTimeStamp = true

#下沉后, 生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

开发模块1——数据采集

3 模块开发——数据采集

3.1 需求

3.2 技术选型

3.3 Flume日志采集系统搭建

猜你喜欢