Flume高级应用

本文链接： https://blog.csdn.net/a805814077/article/details/102779366

1.Flume核心组件

1.1source类型

Source 负责接收 event 或通过特殊机制产生 event，并将 events 批量的放到一个或多个

1.1.1netcat

来自于主机的网络端口数据，一旦指定的主机的端口中有数据就会被作为数据源采集

a1.sources.r1.type = netcat
a1.sources.r1.bind = 指定监控的主机
a1.sources.r1.port = 指定数据源的端口

1.1.2exec

数据源来自于一个unix命令执行内容结果，它会对操作文档的内容的命令的执行结果进行监听，常用的命令有cat、tail、head

a1.sources.r1.type=exec
a1.sources.r1.command=tail |cat| head

案例(1.txt文件自行创建)：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/hadoop/datas/1.txt

# 指定channel
a1.channels.c1.type = memory
 
# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex01_exec.conf --name a1 -Dflume.root.logger=INFO,console

1.1.3Spooling Directory

数据源来自于一个文件夹下的所有文件，比如：/datas 1.txt 2.txt 3.txt

a1.sources.r1.type = spoordir
a1.sources.r1.spoolDir =  指定需要采集数据来源的文件夹

案例(datas文件夹自行创建)：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/datas
a1.sources.r1.fileSuffix = .finished

# 指定channel
a1.channels.c1.type = memory
 
# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex02_spool.conf --name a1 -Dflume.root.logger=INFO,console

1.1.4avro

数据源来自于avro port一般用于多agent串联，比如一个agent 向另一个agent 发送数据
agent1–avro port-- agent2

a1.sources.r1.type = avro 
a1.sources.r1.bind = 指定主机   这里的这个主机  avro sink中保持一致
a1.sources.r1.port =  指定端口  这里的这个主机  avro sink中保持一致

案例（conf文件自行创建）：

先对agent进行规划

agent1 hadoop01 source – netcat channel – memory sink – avro
agent2 hadoop02 source – avro source channel–memory sink – logger

agent1 :

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port= 44455

# 指定channel
a1.channels.c1.type = memory
 
# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 44466

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

agent2:

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop02
a1.sources.r1.port= 44466

# 指定channel
a1.channels.c1.type = memory
 
# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

先启动hadoop02上的agent2

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex03_agent02_avrosource.conf --name a1 -Dflume.root.logger=INFO,console

再启动 hadoop01上的agent1

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex03_agent01_avrosink.conf --name a1 -Dflume.root.logger=INFO,console

1.2channel类型

包含 event 驱动和轮询两种类型

1.2.1memory channel

# 数据存储内存中
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000  memory中存储的数据的最大条数
a1.channels.c1.transactionCapacity = 10000  每次提交的数据量

1.2.2file channel

是基于磁盘的

1.2.3jdbc

关于数据库的

1.3sink类型

sink 负责将 event传输到下一跳或最终目的地，成功后将 event 从channel移除

1.3.1avro

接收avro port的数据发送给另一个agent

a1.sinks.k1.type = avro 
a1.sinks.k1.hostname = 指定绑定的主机  一般指定下一个的agenT的主机
a1.sinks.k1.port= 指定数据存放的端口

1.3.2logger

将结果打印到控制台中

1.3.3hdfs

将结果收集到hdfs中

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = 指定hdfs 的路径

案例(2.txt自行创建)：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/hadoop/datas/2.txt

# 指定channel
a1.channels.c1.type = memory
 
# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/data

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex04_hdfs_sink.conf --name a1 -Dflume.root.logger=INFO,console

2.拦截器

用来拦截数据源给数据源进行初步处理，可以给数据源添加一个标识。可以在sink端对不同标识的数据进行不同的处理。event{header:{k=v}}

2.1时间戳拦截器

Timestamp Interceptor，拦截数据源给header信息中添加一个时间戳，header{timestamp= 142526273}

# 指定拦截器别名
a1.sources.r1.interceptors = i1
# 指定拦截器类型
a1.sources.r1.interceptors.i1.type = timestamp

案例（conf文件内容自拟）：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477

# 指定source的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

# 指定channel
a1.channels.c1.type = memory
 
# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex05_intc01_time.conf --name a1 -Dflume.root.logger=INFO,console

2.2主机名拦截器

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
a1.sources.r1.interceptors.i1.useIP = false

案例（conf文件内容自拟）：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477

# 指定source的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
# a1.sources.r1.interceptors.i1.useIP = false

# 指定channel
a1.channels.c1.type = memory
 
# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex06_host.conf --name a1 -Dflume.root.logger=INFO,console

2.3静态拦截器

Static Interceptor，这个拦截器可以手动指定key==value值的（最常用）

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = datacenter
a1.sources.r1.interceptors.i1.value = NEW_YORK

案例（conf文件内容自拟）：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477

# 指定source的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = netcat-hadoop01-44477

# 指定channel
a1.channels.c1.type = memory
 
# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex07_static.conf --name a1 -Dflume.root.logger=INFO,console

2.4多个拦截器

案例（conf文件内容自拟）：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477

# 指定source的拦截器
a1.sources.r1.interceptors = i1 i2 i3
# 设置第一个拦截器
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = netcat-hadoop01-44477

# 设置第二个拦截器
a1.sources.r1.interceptors.i2.type = host

# 设置第三个拦截器
a1.sources.r1.interceptors.i3.type = timestamp


# 指定channel
a1.channels.c1.type = memory
 
# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex08.conf --name a1 -Dflume.root.logger=INFO,console

综合案例：A、B 两台日志服务机器实时生产日志主要类型为 access.log、nginx.log、web.log
现在要求：把 A、B 机器中的 access.log、nginx.log、web.log 采集汇总到 C 机器上然后统一收集到 hdfs中。

agent1与agent2

# 当前这个agent的名字
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

# 指定source
# 指定 r1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/hadoop/datas/log/access.log
# 指定r1 对应的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = logname
a1.sources.r1.interceptors.i1.value = access

# 指定  r2
a1.sources.r2.type = exec
a1.sources.r2.command = tail -f /home/hadoop/datas/log/ngix.log

a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = logname
a1.sources.r2.interceptors.i2.value = ngix

# 指定 r3
a1.sources.r3.type = exec
a1.sources.r3.command = tail -f /home/hadoop/datas/log/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = logname
a1.sources.r3.interceptors.i3.value = web

# 指定channel
a1.channels.c1.type = memory
 
# 指定sink的类型 avro 
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop03
a1.sinks.k1.port = 55566


# 绑定 channel sink source 
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1

agent3

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop03
a1.sources.r1.port= 55566

# 指定channel
a1.channels.c1.type = memory
 
# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /source/log/%{logname}/%Y%m%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 回滚条件
a1.sinks.k1.hdfs.rollSize = 10240
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollCount =0 
a1.sinks.k1.hdfs.idleTimeout = 30

# 文件输出格式
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

在hadoop03上启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent03_zh.conf --name a1 -Dflume.root.logger=INFO,console

在hadoop02上启动：

../bin/flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent02_zh.conf --name a1 -Dflume.root.logger=INFO,console

在hadoop01上启动：

../bin/flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent01_zh.conf --name a1 -Dflume.root.logger=INFO,console

3.高可用配置

架构

	agent
hadoop01	agent1
hadoop02	agent2、agent4
hadoop03	agent3、agent5

描述：123收集数据，4汇总，5备份

agent1、agent2、agent3

#agent name: agent1
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

# 设置source 
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/datas/log/web.log
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = Type
a1.sources.r1.interceptors.i1.value = LOGIN
a1.sources.r1.interceptors.i2.type = timestamp

# 设置channel 
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 设置sink 
# 将多个sink放在一个组中  组名
a1.sinkgroups = g1
# set k1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 52020

# set k2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 52020

#设置sink的 优先级   高可用
a1.sinkgroups.g1.sinks = k1 k2
#set设置失败切换
a1.sinkgroups.g1.processor.type = failover
# 设置优先级   1-10   越高 优先
a1.sinkgroups.g1.processor.priority.k1 = 10
a1.sinkgroups.g1.processor.priority.k2 = 1
# 时间间隔
a1.sinkgroups.g1.processor.maxpenalty = 10000 
# 设置 绑定关系  
a1.sources.r1.channels = c1   
a1.sinks.k1.channel = c1  
a1.sinks.k2.channel = c1

agent4

#set agent name
a2.sources = r1
a2.channels = c1
a2.sinks = k1

# source
a2.sources.r1.type = avro
## 当前主机为什么，就修改成什么主机名
a2.sources.r1.bind = hadoop02
a2.sources.r1.port = 52020
a2.sources.r1.interceptors = i1
a2.sources.r1.interceptors.i1.type = static
a2.sources.r1.interceptors.i1.key = Collector
# 当前主机为什么，就修改成什么主机名
a2.sources.r1.interceptors.i1.value = hadoop02

#set channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#set sink to logger
a2.sinks.k1.type=logger

a2.sinks.k1.channel=c1
a2.sources.r1.channels = c1

agent5

#set agent name
a2.sources = r1
a2.channels = c1
a2.sinks = k1

# source
a2.sources.r1.type = avro
## 当前主机是什么，就修改成什么主机名
a2.sources.r1.bind = hadoop03
a2.sources.r1.port = 52020
a2.sources.r1.interceptors = i1
a2.sources.r1.interceptors.i1.type = static
a2.sources.r1.interceptors.i1.key = Collector
# 当前主机是什么，就修改成什么主机名
a2.sources.r1.interceptors.i1.value = hadoop03

#set channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#set sink to logger
a2.sinks.k1.type=logger

a2.sinks.k1.channel=c1
a2.sources.r1.channels = c1

先启动hadoop02上的agent04

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent04.conf --name a2 -Dflume.root.logger=INFO,console

再启动hadoop03上的agent05

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent05.conf --name a2 -Dflume.root.logger=INFO,console

最后启动agent123

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent01.conf --name a1 -Dflume.root.logger=INFO,console

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent02.conf --name a1 -Dflume.root.logger=INFO,console

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent03.conf --name a1 -Dflume.root.logger=INFO,console

文章目录

1.Flume核心组件

1.1source类型

1.1.1netcat

1.1.2exec

1.1.3Spooling Directory

1.1.4avro

1.2channel类型

1.2.1memory channel

1.2.2file channel

1.2.3jdbc

1.3sink类型

1.3.1avro

1.3.2logger

1.3.3hdfs

2.拦截器

2.1时间戳拦截器

2.2主机名拦截器

2.3静态拦截器

2.4多个拦截器

3.高可用配置

猜你喜欢