Compatibility with hadoop and hive
Spark 3.0 官方默认支持的Hadoop最低版本为2.7, Hive最低版本为 1.2。我们平台使用的CDH 5.13,对应的版本分别为hadoop-2.6.0, hive-1.1.0。所以尝试自己去编译Spark 3.0 来使用。
编译环境: Maven 3.6.3, Java 8, Scala 2.12
Hive版本预先编译
因为Hive 1.1.0 实在是太久远了,很多依赖包和Spark3.0中不兼容,需要需要重新编译。
commons-lang3 包版本太旧了,缺少JAVA_9以上的支持
hive exec 模块编译: mvn clean package install -DskipTests -pl ql -am -Phadoop-2
代码改动:
diff --git a/pom.xml b/pom.xml
index 5d14dc4..889b960 100644
--- a/pom.xml
+++ b/pom.xml
@@ -248,6 +248,14 @@
<enabled>false</enabled>
</snapshots>
</repository>
+ <repository>
+ <id>spring</id>
+ <name>Spring repo</name>
+ <url>https://repo.spring.io/plugins-release/</url>
+ <releases>
+ <enabled>true</enabled>
+ </releases>
+ </repository>
</repositories>
<!-- Hadoop dependency management is done at the bottom under profiles -->
@@ -982,6 +990,9 @@
<profiles>
<profile>
<id>thriftif</id>
+ <properties>
+ <thrift.home>/usr/local/opt/[email protected]</thrift.home>
+ </properties>
<build>
<plugins>
<plugin>
diff --git a/ql/pom.xml b/ql/pom.xml
index 0c5e91f..101ef11 100644
--- a/ql/pom.xml
+++ b/ql/pom.xml
@@ -736,10 +736,7 @@
<include>org.apache.hive:hive-exec</include>
<include>org.apache.hive:hive-serde</include>
<include>com.esotericsoftware.kryo:kryo</include>
- <include>com.twitter:parquet-hadoop-bundle</include>
- <include>org.apache.thrift:libthrift</include>
<include>commons-lang:commons-lang</include>
- <include>org.apache.commons:commons-lang3</include>
<include>org.jodd:jodd-core</include>
<include>org.json:json</include>
<include>org.apache.avro:avro</include>
Spark 编译
# Apache
git clone [email protected]:apache/spark.git
git checkout v3.0.0
# Leyan Version 主要设计spark hive的兼容性改造
git clone [email protected]:HDP/spark.git
git checkout -b v3.0.0_cloudera origin/v3.0.0_cloudera
./dev/make-distribution.sh --name cloudera --tgz -DskipTests -Phive -Phive-thriftserver -Pyarn -Pcdhhive
--本地仓库更新
mvn clean install -DskipTests=true -Phive -Phive-thriftserver -Pyarn -Pcdhhive
# deploy
rm -rf /opt/spark-3.0.0-bin-cloudera
tar -zxvf spark-3.0.0-bin-cloudera.tgz
rm -rf /opt/spark-3.0.0-bin-cloudera/conf
ln -s /etc/spark3/conf /opt/spark-3.0.0-bin-cloudera/conf
cd /opt/spark/jars
zip spark-3.0.0-jars.zip ./*
HADOOP_USER_NAME=hdfs hdfs dfs -put -f spark-3.0.0-jars.zip hdfs:///deploy/config/spark-3.0.0-jars.zip
rm spark-3.0.0-jars.zip
# add config : spark.yarn.archive=hdfs:///deploy/config/spark-3.0.0-jars.zip
Configuration
# 开启AE模式
spark.sql.adaptive.enabled=true
# spark生成parquet文件使用legacy模式,否则生成的文件无法被Hive或其他组件读取
spark.sql.parquet.writeLegacyFormat=true
# 兼容使用Spark2的 external shuffle service
spark.shuffle.useOldFetchProtocol=true
spark.sql.storeAssignmentPolicy=LEGACY
# 现在默认datasource v2不支持数据源表和目标表是同一个表,通过下面参数跳过校验
#spark.sql.hive.convertInsertingPartitionedTable=false
spark.sql.sources.partitionOverwriteVerifyPath=false
Tips
- 在使用Maven编译的时候,以前版本支持多CPU并发编译,现在不可以了,否则编译的时候会导致死锁
- 在使用maven命令进行编译的使用不能同时指定
package
和install
,否则编译时会有冲突 - 模版编译命令
mvn clean install -DskipTests -Phive -Phive-thriftserver -Pyarn -DskipTests -Pcdhhive
,可以自定义编译模块和编译target - 想要使用Spark3.0,还是需要进行魔改的。
yarn
模块要稍微改动。mvn clean install -DskipTests=true -pl resource-managers/yarn -am -Phive -Phive-thriftserver -Pyarn -Pcdhhive
- 所有Spark3.0 包在本地全部安装完毕后,可以继续编译above-board项目
- 删除Spark3.0 中对高版本hive的支持
- 当切换到CDH的hive版本时发现,该hive版本shade的commons jar太旧了,进行重新打包
TroubleShooting
需要更新本地Hive-exec包的依赖,打包的时候减少shade thrift包的代码
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, prd-zboffline-044.prd.leyantech.com, executor 1): java.lang.NoSuchMethodError: shaded.parquet.org.apache.thrift.EncodingUtils.setBit(BIZ)B
at org.apache.parquet.format.FileMetaData.setVersionIsSet(FileMetaData.java:349)
at org.apache.parquet.format.FileMetaData.setVersion(FileMetaData.java:335)
at org.apache.parquet.format.Util$DefaultFileMetaDataConsumer.setVersion(Util.java:122)
at org.apache.parquet.format.Util$5.consume(Util.java:161)
at org.apache.parquet.format.event.TypedConsumer$I32Consumer.read(TypedConsumer.java:78)