第1章集群规划

Flume集群，负载均衡和故障转移模式，笔者准备了3台机器安装了flume，其中webapp200是应用服务器，flume安装在这里，目的是收集应用服务器上的日志，通过2个avro sink分别对接到flume130和flume131机器；再通过flume130和flume131将数据传输到HDFS。（注：吞吐量大的channels可以换成kafka）。

	Webapp200	Flume130	Flume131
sources	TAILDIR	avro	avro
channels	file	file	file
sinks	avro	hdfs	hdfs

第2章流程图

第3章下载安装

3.1 下载地址

官网：http://flume.apache.org/

3.2 解压

解压到/opt/module/目录

$ tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /opt/module/

3.3 配置环境变量

配置JAVA_HOME

修改配置文件名称

$ mv flume-env.sh.template flume-env.sh

修改Flume-env.sh

$ vi conf/flume-env.sh

修改JAVA_HOME，修改成自己的JAVA_HOME

export JAVA_HOME=/opt/module/jdk1.8.0_221

配置完后，将flume分发到其他机器

第4章配置Agent

4.1 webapp200中的Agent

创建taildir-file-hdfs.conf，并添加如下内容：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /opt/module/apache-flume-1.9.0-bin/position/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/logs/info*.log*

# Describe the sinkgroups
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2 k3
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
a1.sinkgroups.g1.processor.selector.maxTimeOut=10000

#Define the sink k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = flume130
a1.sinks.k1.port = 4545

#Define the sink k2
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c1
a1.sinks.k2.hostname = flume131
a1.sinks.k2.port = 4545

# Use a channel which buffers events in memory
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/module/apache-flume-1.9.0-bin/data/checkpoint/balance
a1.channels.c1.dataDirs=/opt/module/apache-flume-1.9.0-bin/data/balance
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

4.2 flume130和flume131中的Agent

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4545

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://hadoop100:9000/flume/events/%y-%m-%d/%H
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.batchSize=100
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.rollInterval=0
a1.sinks.k1.hdfs.rollSize=134217700
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = hour

# Use a channel which buffers events in memory
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/module/apache-flume-1.9.0-bin/data/checkpoint/balance
a1.channels.c1.dataDirs=/opt/module/apache-flume-1.9.0-bin/data/balance
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

注意HDFS sink 需要放置相应的jar包和xml配置文件到flume目录下，下载和使用链接（笔者亲测有效，注意版本）：flume hdfs sink所需jar包（flume1.9.0 hadoop3.1.2）

第5章启动flume

启动命令

$ bin/flume-ng agent -n a1 -c conf -f job/taildir-file-avro.conf

后台启动，在结尾加上&

$ nohup bin/flume-ng agent -n a1 -c conf -f job/taildir-file-avro.conf &

再加上nohup可以把原本在console输出的运行日志输出在[当前运行目录]/nohup.out中

第6章关闭flume

flume进程启动动没有关闭的命令，只能kill掉。

查看占用4545端口的进程ID

$ netstat -nap | grep 4545

或者直接jps找到flume进程，然后kill

$ kill [pid]

大数据实操篇 No.4-Flume集群搭建

第1章集群规划

第2章流程图

第3章下载安装

3.1 下载地址

3.2 解压

3.3 配置环境变量

第4章配置Agent

4.1 webapp200中的Agent

4.2 flume130和flume131中的Agent

第5章启动flume

第6章关闭flume

猜你喜欢

大数据实操篇 No.4-Flume集群搭建

第1章 集群规划

第2章 流程图

第3章 下载安装

3.1 下载地址

3.2 解压

3.3 配置环境变量

第4章 配置Agent

4.1 webapp200中的Agent

4.2 flume130和flume131中的Agent

第5章 启动flume

第6章 关闭flume

猜你喜欢

第1章集群规划

第2章流程图

第3章下载安装

第4章配置Agent

第5章启动flume

第6章关闭flume