一、问题现象
今天的azkaban有个spark任务(离线)报错, 报错信息如下:
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO yarn.Client: Application report for application_1640678855326_133429 (state: ACCEPTED)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO yarn.Client: Application report for application_1640678855326_133429 (state: RUNNING)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO yarn.Client:
08-02-2022 07:09:32 CST DailyReport2Excel INFO - client token: N/A
08-02-2022 07:09:32 CST DailyReport2Excel INFO - diagnostics: N/A
08-02-2022 07:09:32 CST DailyReport2Excel INFO - ApplicationMaster host: 111.111.111.131
08-02-2022 07:09:32 CST DailyReport2Excel INFO - ApplicationMaster RPC port: 0
08-02-2022 07:09:32 CST DailyReport2Excel INFO - queue: root.users.hdfs
08-02-2022 07:09:32 CST DailyReport2Excel INFO - start time: 1644275223892
08-02-2022 07:09:32 CST DailyReport2Excel INFO - final status: FAILED
08-02-2022 07:09:32 CST DailyReport2Excel INFO - tracking URL: http://nn1.my-cdh.com:8088/proxy/application_1640678855326_133429/
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO yarn.Client: Application report for application_1640678855326_133429 (state: FINISHED)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO yarn.Client:
08-02-2022 07:09:32 CST DailyReport2Excel INFO - client token: N/A
08-02-2022 07:09:32 CST DailyReport2Excel INFO - diagnostics: User class threw exception: java.io.FileNotFoundException: /data/reports/xxx平台xx报表.xls (Permission denied)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at java.io.FileOutputStream.open0(Native Method)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at java.io.FileOutputStream.open(FileOutputStream.java:270)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at com.david.report.DailyReport2Excel$.do_business(DailyReport2Excel.scala:409)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at com.david.report.DailyReport2Excel$$anonfun$main$1.apply$mcVI$sp(DailyReport2Excel.scala:56)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at com.david.report.DailyReport2Excel$.main(DailyReport2Excel.scala:49)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at com.david.report.DailyReport2Excel.main(DailyReport2Excel.scala)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at java.lang.reflect.Method.invoke(Method.java:497)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)
08-02-2022 07:09:32 CST DailyReport2Excel INFO -
08-02-2022 07:09:32 CST DailyReport2Excel INFO - ApplicationMaster host: 111.111.111.131
08-02-2022 07:09:32 CST DailyReport2Excel INFO - ApplicationMaster RPC port: 0
08-02-2022 07:09:32 CST DailyReport2Excel INFO - queue: root.users.hdfs
08-02-2022 07:09:32 CST DailyReport2Excel INFO - start time: 1644275223892
08-02-2022 07:09:32 CST DailyReport2Excel INFO - final status: FAILED
08-02-2022 07:09:32 CST DailyReport2Excel INFO - tracking URL: http://nn1.my-cdh.com:8088/proxy/application_1640678855326_133429/
08-02-2022 07:09:32 CST DailyReport2Excel INFO - user: hdfs
08-02-2022 07:09:32 CST DailyReport2Excel INFO - Exception in thread "main" org.apache.spark.SparkException: Application application_1640678855326_133429 finished with failed status
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at org.apache.spark.deploy.yarn.Client.run(Client.scala:1153)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1568)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO util.ShutdownHookManager: Shutdown hook called
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-0f892c24-67c2-424a-a4fd-f24853d3eef3
08-02-2022 07:09:32 CST DailyReport2Excel INFO - 22/02/08 07:09:32 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-07eb9a13-8f22-4a6a-aac1-b5daa87980cd
08-02-2022 07:09:32 CST DailyReport2Excel INFO - Process completed unsuccessfully in 152 seconds.
08-02-2022 07:09:32 CST DailyReport2Excel ERROR - Job run failed!
java.lang.RuntimeException: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:304)
at azkaban.execapp.JobRunner.runJob(JobRunner.java:786)
at azkaban.execapp.JobRunner.doRun(JobRunner.java:601)
at azkaban.execapp.JobRunner.run(JobRunner.java:562)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
at azkaban.jobExecutor.utils.process.AzkabanProcess.run(AzkabanProcess.java:125)
at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:296)
... 8 more
08-02-2022 07:09:32 CST DailyReport2Excel ERROR - azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1 cause: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
08-02-2022 07:09:32 CST DailyReport2Excel INFO - Finishing job DailyReport2Excel at 1644275372628 with status FAILED
二、梳理定位
经过对近期操作的梳理盘点, 整理了如下信息:
①.执行该spark任务的用户为: hdfs, 其所属组信息为 “queue: root.users.hdfs”;
②.最近CDH集群曾添加过3台新节点, 而azkaban log中显示的服务器 111.111.111.131 正是3台新节点中的一台;
③.经查询历史azkaban的执行日志, 发现凡是spark on yarn, 经yarn调度到这3台服务器上的spark task, 都会报这样的错误, 导致该task调度失败.
④.上述spark task会先落盘至本地linux的/data目录下, 之后上传至HDFS目录中.
三、推断
新添加的3台spark节点, 在 yarn 将 spark task 调度到这些节点上后, 计算结果落盘到 本地linux 的 data目录 的过程中, 发现用户hdfs并没有相应目录的写&执行权限, 导致存储失败.
四、验证
4.1 在之前的集群节点上
以节点 cdh01 为例:
[root@cdh01 ~]# id hdfs
uid=996(hdfs) gid=993(hdfs) groups=993(hdfs),0(root)
[root@cdh01 ~]# id hadoop
id: hadoop: no such user
4.2 在新加的集群节点上
以节点 cdh31 为例:
[root@cdh31 ~]# id hdfs
uid=994(hdfs) gid=991(hdfs) groups=991(hdfs),993(hadoop)
[root@cdh31 ~]# id hadoop
id: hadoop: no such user
[root@cdh31 ~]# groups hdfs
hdfs : hdfs hadoop
可以看到,新节点上的hdfs用户, 并没有在root组下, 导致无法向 /data文件夹下写入文件.
五、解决方案
5.1 追加hdfs用户到root组下
[root@cdh31 ~]# usermod -a -G root hdfs
[root@cdh31 ~]# id hdfs
uid=994(hdfs) gid=991(hdfs) groups=991(hdfs),0(root),993(hadoop)
5.2 将hdfs用户从hadoop组中移除
[root@cdh31 ~]# gpasswd -d hdfs hadoop
Removing user hdfs from group hadoop
[root@cdh31 ~]# id hdfs
uid=994(hdfs) gid=991(hdfs) groups=991(hdfs),0(root)
5.3 查看hdfs用户&所属组的最新对应关系
[root@cdh31 ~]# id hdfs
uid=994(hdfs) gid=991(hdfs) groups=991(hdfs),0(root)
[root@cdh31 data]# groups hdfs
hdfs : hdfs root
六、引申
6.1 追加用户到用户组
将一个用户添加到用户组中,千万不能直接用:
usermod -G groupA
这样做会使你离开其他用户组,仅仅做为 这个用户组 groupA 的成员。
应该用 加上 -a 选项:
usermod -a -G groupA user
(FC4: usermod -G groupA,groupB,groupC user)
-a 代表 append, 也就是 将自己添加到 用户组groupA 中,而不必离开 其他用户组。
命令的所有的选项,及其含义:
Options:
-c, –comment COMMENT new value of the GECOS field
-d, –home HOME_DIR new home directory for the user account
-e, –expiredate EXPIRE_DATE set account expiration date to EXPIRE_DATE
-f, –inactive INACTIVE set password inactive after expiration
to INACTIVE
-g, –gid GROUP force use GROUP as new primary group
-G, –groups GROUPS new list of supplementary GROUPS
-a, –append append the user to the supplemental GROUPS
mentioned by the -G option without removing
him/her from other groups
-h, –help display this help message and exit
-l, –login NEW_LOGIN new value of the login name
-L, –lock lock the user account
-m, –move-home move contents of the home directory to the new
location (use only with -d)
-o, –non-unique allow using duplicate (non-unique) UID
-p, –password PASSWORD use encrypted password for the new password
-s, –shell SHELL new login shell for the user account
-u, –uid UID new UID for the user account
-U, –unlock unlock the user account
查看用户所属的组使用命令:
$ groups user
或者查看文件:
$ cat /etc/group
6.2 如何将用户从一个组中移除?
gpasswd -d userName groupName
注意: 通过如下方式移除, 将删除该用户, 与本场景不符
deluser USER GROUP 将用户从一个组中删除
例: deluser mike students
常用选项:
--quiet | -q 不将进程信息发给 stdout
--help | -h 帮助信息
--version | -v 版本号和版权
--conf | -c 文件 以制定文件作为配置文件