十:WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set,解决案例

一:问题现象:

在spark on yarn 提交任务是,提示如下:

WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

INFO yarn.Client: Uploading resource file:/tmp/spark-27a2d9ca-106c-4f4a-baff-c96ef5081c51/__spark_libs__808575299793112451.zip -> hdfs://weizhonggui/user/hadoop/.sparkStaging/application_1543886353459_0001/__spark_libs__808575299793112451.zip
18/12/24 23:32:36 INFO yarn.Client: Uploading resource file:/tmp/spark-27a2d9ca-106c-4f4a-baff-c96ef5081c51/__spark_conf__4031622468796062240.zip -> hdfs://weizhonggui/user/hadoop/.sparkStaging/application_1543886353459_0001/spark_conf.zip

在这里插入图片描述

二:原因分析:

在[https://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties]里有解释:

To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.

继续查看具体的 Spark Properties:
spark.yarn.jars:none :List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn’t need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.

spark.yarn.archive:An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application’s containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.

就是在默认情况:Spark on YARN要用Spark jars(默认就在Spark安装目录),但这个jars也可以再HDFS任何可以读到的地方,这样就方便每次应用程序跑的时候在节点上可以Cache,这样就不用上传这些jars,

三:处理过程:

3.1.创建 archive: jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
3.2.上传jar包到 HDFS: hdfs dfs -put spark-libs.jar /system/SparkJars/jar
hdfs下创建目录:hdfs dfs -mkdir -p /system/SparkJars/jar
3.3. 在spark-default.conf中设置 spark.yarn.archive=hdfs:///system/SparkJars/jar/spark-libs.jar

四:问题总结:

这是SPARK on YARN,调优的一个手段,节约每个NODE上传JAR到HDFS的时间,可通过具体情况查看:
在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/weizhonggui/article/details/85240804
今日推荐