Azkaban
调度框架
实现阿里云的需求
作业跟作业之间是有依赖关系的
Hue可以直接配Oozie 拖拉拽 重量级
Azkaban 难用
大公司有自己的调度框架
pre-xxx不要下载
web server和executor server在生产环境中是否需要HA
[hadoop@hadoop004 software]$ wget https://github.com/azkaban/azkaban/archive/3.57.0.tar.gz
[hadoop@hadoop004 software]$ tar zxf 3.57.0.tar.gz -C ~/src
安装以3.57.0版本的Azkabab需要有jdk1.8、gradle、git环境
gradle安装地址 https://gradle.org/releases/
[hadoop@hadoop004 src]$ wget https://services.gradle.org/distributions/gradle-4.1-all.zip
[hadoop@hadoop004 src]$ wget https://downloads.gradle.org/distributions/gradle-4.6-all.zip
[hadoop@hadoop004 src]$ ls
azkaban-3.57.0 hadoop-2.6.0-cdh5.7.0 hive-1.1.0-cdh5.7.0 spark-2.3.3.tgz spark-2.4.2.tgz
gradle-4.1-all.zip hello spark-2.3.3 spark-2.4.2
把gradle-4.1-all.zip这个包和Azkaban整合起来使用
[hadoop@hadoop004 src]$ cp gradle-4.1-all.zip ~/src/azkaban-3.57.0/gradle/wrapper/
[hadoop@hadoop004 src]$ cd azkaban-3.57.0/gradle/wrapper/
[hadoop@hadoop004 wrapper]$ ls
gradle-4.1-all.zip gradle-wrapper.jar gradle-wrapper.properties
[hadoop@hadoop004 wrapper]$ vim gradle-wrapper.properties
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
#distributionUrl=https\://services.gradle.org/distributions/gradle-4.6-all.zip
distributionUrl=gradle-4.1-all.zip
[hadoop@hadoop004 wrapper]$ cd ..
[hadoop@hadoop004 gradle]$ cd ..
[hadoop@hadoop004 azkaban-3.57.0]$ ls
az-core az-intellij-style.xml azkaban-web-server gradlew.bat
az-crypto az-jobsummary az-reportal LICENSE
az-examples azkaban-common build.gradle NOTICE
az-exec-util azkaban-db CONTRIBUTING.md README.md
az-flow-trigger-dependency-plugin azkaban-exec-server docs requirements.txt
az-flow-trigger-dependency-type azkaban-hadoop-security-plugin gradle settings.gradle
az-hadoop-jobtype-plugin azkaban-solo-server gradle.properties test
az-hdfs-viewer azkaban-spi gradlew tools
[hadoop@hadoop004 azkaban-3.57.0]$ ./gradlew build installDist
[hadoop@hadoop004 azkaban-3.57.0]$ ./gradlew build installDist -x test
Parallel execution with configuration on demand is an incubating feature.
> Task :azkaban-solo-server:compileJava
Note: /home/hadoop/src/azkaban-3.57.0/azkaban-solo-server/src/main/java/azkaban/soloserver/AzkabanSingleServer.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
> Task :azkaban-web-server:npm_install
added 39 packages in 1.02s
> Task :az-reportal:compileJava
Note: /home/hadoop/src/azkaban-3.57.0/az-reportal/src/main/java/azkaban/reportal/util/StreamProviderHDFS.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
> Task :az-hdfs-viewer:compileJava
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
> Task :az-hadoop-jobtype-plugin:compileJava
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: /home/hadoop/src/azkaban-3.57.0/az-hadoop-jobtype-plugin/src/main/java/azkaban/jobtype/HadoopSecureSparkWrapper.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
BUILD SUCCESSFUL in 20s
83 actionable tasks: 33 executed, 50 up-to-date
成功了!!
[hadoop@hadoop004 app]$ cd ~/src/azkaban-3.57.0/azkaban-solo-server/build/distributions/
[hadoop@hadoop004 distributions]$ cp azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz ~/software/
[hadoop@hadoop004 distributions]$ cd ~/software/
[hadoop@hadoop004 software]$ tar -zxf azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz -C ~/app/
[hadoop@hadoop004 software]$ cd ~/app/azkaban-solo-server-0.1.0-SNAPSHOT/
[hadoop@hadoop004 azkaban-solo-server-0.1.0-SNAPSHOT]$ bin/start-solo.sh
[hadoop@hadoop004 azkaban-solo-server-0.1.0-SNAPSHOT]$ jps
2323 Jps
2295 AzkabanSingleServer
创建项目
Azkaban文档地址如下
https://azkaban.readthedocs.io/en/latest/createFlows.html
flow20.project
azkaban-flow-version: 2.0
basic.flow
nodes:
- name: jobA
type: command
config:
command: echo "This is an echoed text."
下面跟着官网的例子来一个多作业调度的
其中,type:noop表示no operation
下面通过Azkaban来调度之前写的ETL脚本
首先,我们先回顾一下这个脚本
#!/usr/bin/env bash
process_date=20180711
echo "start...."
hdfs dfs -mkdir -p /g6/hadoop/accesslog/20180711
hdfs dfs -put /home/hadoop/data/login.log /g6/hadoop/accesslog/20180711/
hadoop jar /home/hadoop/lib/g6-hadoop-1.0.jar com.ruozedata.hadoop.mapreduce.driver.LogETLDriver /g6/hadoop/accesslog/20180711 /g6/hadoop/access/output6
hadoop fs -mkdir -p /g6/hadoop/access/clear/day=20180711
hadoop fs -mv /g6/hadoop/access/output6/part* /g6/hadoop/access/clear/day=20180711/
hive -f /home/hadoop/script/hive_sql/create.sql
echo "Success"
create external table g6_access_log4 (
cdn string,
region string,
level string,
time string,
ip string,
domain string,
url string,
traffic bigint
)
partitioned by (day string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/g6/hadoop/access/clear';
alter table g6_access_log4 add if not exists partition(day='20180711');
select * from g6_access_log4 limit 10;
成功了!!