一、重温

1、array、map、struct

2、meta

3、join

4、compression（压缩）

二、Flume

RDBMS==>Sqoop==>Hadoop

日志：分散在各个服务器上，如何==>Hadoop?

1）crontab任务将日志写到文件，然后上传到Hadoop

2）Flume

1、Flume官网

介绍：

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.

collecting 采集 source

aggregating 聚合 channel （找个地方把采集过来的数据暂存下）

moving 移动 sink

Flume：编写配置文件，组合source、channel、sink三者之间的关系

Agent：就是由source、channel、sink组成

编写Flume的配置文件其实就是配置Agent的过程

总结：

Flume就是一个框架，针对日志数据进行采集汇总，把日志从A地方采集到B地方去

2、下载安装apache-flume-1.6.0-cdh5.7.0-bin

http://archive.cloudera.com/cdh5/cdh/5/

然后rz上传，tar解压，chown改变所属用户

3、配置

惯例将flume-env.sh.template改为flume-env.sh

将其中JAVA_HOME输入正确的地址

管理配置系统环境变量。

4、CDH文档

由于apache版本的flume坑多一些，所以看CDH的会省心一点

地址：http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.7.0/

flume-og（旧版本）

flume-ng（新版本）

5、命令

三、Flume使用

1、配置Flume,官网：

Flume agent configuration is stored in a local configuration file. This is a text file that follows the Java properties file format. Configurations for one or more agents can be specified in the same configuration file. The configuration file includes properties of each source, sink and channel in an agent and how they are wired together to form data flows.

2、开始配置

$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

可以查命令帮助了解

agent_name：配置的agent的名称

-n:--name,-n <name> the name of this agent (required),agent的名称

-c:--conf,-c <conf> use configs in <conf> directory，使用conf文件的路径

-f:--conf-file,-f <file> specify a config file (required if -z missing)，编写自定义flume_agent的config文件

bin/flume-ng agent --name a1 \

--conf $FLUME_HOME/conf \

--conf-file /opt/script/flume/simple-flume.conf \

-Dflume/opt/script/flume.root.logger=INFO,console \

-Dflume.monitoring.type=http \

-Dflume.monitoring.port=34343

3、CDH官网conf文档例子

从指定的网络端口上采集日志到控制台输出:

# example.conf: A single-node Flume configuration # Name the components on this agent

a1.sources = r1 
a1.sinks = k1 
a1.channels = c1 

a1：就是agent的名称

r1：sources的名称

k1：sinks的名称

c1：channels的名称

a1.sources.r1.type = netcat 
a1.sources.r1.bind =0.0.0.0 
a1.sources.r1.port = 44444

-------------------------------------------------------

# Describe/configure the source

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events in memory

a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

将上述sources,sinks中的channels指向所用的channels
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

将上述内容写入/opt/script/flume/simple-flume.conf：a1.sources = r1a1.sinks = k1
a1.channels = c1 
a1.sources.r1.type = netcat 
a1.sources.r1.bind =0.0.0.0 
a1.sources.r1.port = 44444

a1.sinks.k1.type = logger

a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

执行：
bin/flume-ng agent --name a1 --conf $FLUME_HOME/conf --conf-file /opt/script/flume/simple-flume.conf \

-Dflume.root.logger=INFO,console \

-Dflume.monitoring.type=http \

-Dflume.monitoring.port=34343

4、往本地IP端口打印字符

flume会采集并在控制台打印出来。

5、对当前flume作业的配置文件有重载的功能

更新conf后无需重启flume。

6、event

Event：一条数据

7、channels的监控

浏览器输入：http://192.168.137.131:34343/ 本地IP:当前flume作业的端口

8、Event参数

Event: { headers:{} body: 68 61 68 61 0D haha. }

（1）headers：

（2）body：字节数组

四、Flume支持的source、channel、sink

1、source

Avro Source

Exec Source  ：tail -F xxx.log

JMS Source

Spooling Directory Source ：监控文件夹（不能有子文件夹）

Taildir：

NetCat：

`2、Sink`

HDFS

logger

avro：配合avro source

kafka

3、channel

memory

file

Agent：各种组合source、channel、sink之间的关系

把一个文件中新增的内容收集到HDFS上去

exec - memory - hdfs

一个文件夹

spooling - memory - hdfs

文件数据写入kafka

exec - memory -kafka

4、如果想采集数据之后做清洗

exec - memory - hdfs ==>Spark/Hive/MR ETL（清洗）==>hdfs<==分析

五、完成功能

1、需求：采集指定文件的内容到HDFS

技术选型：exec - memory -hdfs

./flume-ng agent \

--name a1 \

--conf $FLUME_HOME/conf \

--conf-file /opt/script/flume/exec-memory-hdfs.conf \

-Dflume.root.logger=INFO,console \

-Dflume.monitoring.type=http \

-Dflume.monitoring.port=34343

exec-memory-hdfs.conf：

a1.sources = r1

a1.sinks = k1

a1.channels = c1

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /opt/data/data.log

a1.sinks.k1.type = hdfs

a1.sinks.k1.hdfs.path = hdfs://hadoop002:9000/data/flume/tail   //查namenode,可以在core-site.xml里面看

a1.sinks.k1.hdfs.batchSize = 10

a1.sinks.k1.hdfs.fileType = DataStream

a1.sinks.k1.hdfs.writeFormat = Text

a1.channels.c1.type = memory

a1.sinks.k1.channel = c1

a1.sources.r1.channels = c1

成功：

查看HDFS后台

提问：

为什么临时文件用tmp结尾、而最终文件前缀是flumeData

答：

看官网

问题点：

文件太小，而分配的blocksize太大，小文件太多的话，势必会占用namenode的memory。

2、需求

需求：采集指定文件夹的内容到控制台

选型：spooling - memory - logger

spooling-memory-logger.conf：

a1.sources = r1

a1.sinks = k1

a1.channels = c1

a1.sources.r1.type = spooldir

a1.sources.r1.channels = c1

a1.sources.r1.spoolDir = /opt/tmp/flume

a1.sources.r1.fileHeader = true

a1.sinks.k1.type = logger

a1.sinks.k1.channel = c1

a1.channels.c1.type = memory

命令： ./flume-ng agent \

--name a1 \

--conf $FLUME_HOME/conf \

--conf-file /opt/script/flume/spooling-memory-logger.conf \

-Dflume.root.logger=INFO,console \

-Dflume.monitoring.type=http \

-Dflume.monitoring.port=34343

问题：

1、If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing.

2、If a file name is reused at a later time, Flume will print an error to its log file and stop processing.重名会报错。

3、需求

采集指定文件以及文件夹的内容到logger

选型：taildir - memory - logger

taildir-memory-logger.conf：