Spark Standalone方式报错ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNA

解决方式:
在spark的配置文件spark.default.conf中增加

spark.dynamicAllocation.minExecutors 3

因为笔者的环境中只有三台机器,就设置成3了。
查了好久,网上有说内存小的等等,众说纷纭,但是我改变内存都没有生效。后来在这篇文章https://dev.sobeslavsky.net/apache-spark-sigterm-mystery-with-dynamic-allocation/上面找到了上面的解决方式。


文章内容如下:

APACHE SPARK: SIGTERM MYSTERY WITH DYNAMIC ALLOCATION

Managing jobs in Apache Spark is a challenge. And it becomes even bigger challenge when your executors get stopped without any meaningful error message.

I was running a cluster with eight nodes and a computation job which was supposed to run for around 10 hours. However after just six hours the application got stuck with eight active workers but no tasks running. It seemed like the application just became lazy and didn’t want to do any work anymore. Here are the steps I followed to debug the issue:

Step 1. The cause

The only message related to the issue was the one I found in Spark worker log:
ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
After some digging I have found that the problem is most likely caused by the executor allocating too much memory and getting terminated by something. All I found on Google related to SIGTERM issue in Spark was about YARN resource manager terminating executors. As I was running Spark in standalone mode my guess is that it must have been either Spark resource manager doing this for me or the operating system.

What was more surprising was that Spark was not trying to allocate new executors after the existing ones failed. From Spark documentation it seems that Dynamic resource allocation is supposed to do exactly that, but in this case it was not.

Step 2. The executor story

After further investigation it turned out that Dynamic allocation was indeed trying to allocate a new executor, but the new executor was terminated after being created. The following was happening:

An executor which was executing tasks for several hours received SIGTERM and ended.
A new executor was created, but received SIGTERM
A new executor was created, but received SIGTERM

Step 3. The master mystery

Reading Master logs I have noticed a kind of erratic behavior of the application with respect to executors:

17/04/18 22:11:44 INFO Master: Application app-20170418163116-0001 requested to set total executors to 7.
17/04/18 22:11:47 INFO Master: Application app-20170418163116-0001 requested to set total executors to 6.
17/04/18 22:11:49 INFO Master: Application app-20170418163116-0001 requested to set total executors to 5.
17/04/18 22:11:51 INFO Master: Application app-20170418163116-0001 requested to set total executors to 4.
17/04/18 22:11:53 INFO Master: Application app-20170418163116-0001 requested to set total executors to 3.
17/04/18 22:11:55 INFO Master: Application app-20170418163116-0001 requested to set total executors to 2.
17/04/18 22:11:57 INFO Master: Application app-20170418163116-0001 requested to set total executors to 1.
17/04/18 22:12:09 INFO Master: Application app-20170418163116-0001 requested to set total executors to 0.
17/04/18 22:12:16 INFO Master: Application app-20170418163116-0001 requested to set total executors to 9.
17/04/18 22:12:16 INFO Master: Application app-20170418163116-0001 requested to set total executors to 10.
17/04/18 22:12:16 INFO Master: Application app-20170418163116-0001 requested to set total executors to 12.
17/04/18 22:12:16 INFO Master: Application app-20170418163116-0001 requested to set total executors to 16.
17/04/18 22:12:16 INFO Master: Application app-20170418163116-0001 requested to set total executors to 24.
17/04/18 22:12:16 INFO Master: Application app-20170418163116-0001 requested to set total executors to 25.
17/04/18 22:12:30 INFO Master: Application app-20170418163116-0001 requested to set total executors to 24.
And then again from 24 down to 0. The second time however, the application remained with 0 executors, not doing any work at all.

The mystery is: Why would an application with a full queue of tasks give up all the executors?

Step 4. The solution

The solution was to make application never give up all executors. Fortunately it can be done easily by setting spark.dynamicAllocation.minExecutors to a reasonable value (16 in this case). It prevents the application from terminating all the executors and going to sleep thinking there is not more work to be done.

猜你喜欢

转载自blog.csdn.net/lzufeng/article/details/83418056