[YARN][SPARK-4929] Bug fix: fix the yarn-client code to support HA #3771

SaintBacchus · 2014-12-23T04:22:07Z

Nowadays, yarn-client will exit directly when the HA change happens no matter how many times the am should retry.
The reason may be that the default final status only considerred the sys.exit, and the yarn-client HA cann't benefit from this.
So we should distinct the default final status between client and cluster, because the SUCCEEDED status may cause the HA failed in client mode and UNDEFINED may cause the error reporter in cluster when using sys.exit.

SparkQA · 2014-12-23T05:50:39Z

Test build #24726 has finished for PR 3771 at commit 0e69924.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SaintBacchus · 2014-12-23T06:19:12Z

@andrewor14 can you go through this problem? Thx.

JoshRosen · 2014-12-23T22:35:04Z

Also, /cc @tgravescs, another one of our YARN maintainers.

SaintBacchus · 2014-12-25T07:50:45Z

@tgravescs can you hava a look at this problem?

tgravescs · 2014-12-29T14:11:38Z

Can you please be a bit more specific and detail out exact what happens here? Are you referring to when RM has to failover or during rolling upgrade. Is the container brought down and then back up again... please just describe the scenario and what exactly is happening.

SaintBacchus · 2014-12-30T01:20:16Z

what @tgravescs says is close to the scenario, but it happens during the RM recover after broke down.

            if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) {
              unregister(finalStatus, finalMsg)
              cleanupStagingDir(fs)
            }

In the code, it won't check the isLastAttempt if the finalStatus was FinalApplicationStatus.SUCCEEDED .
When the RM recovering happens, it would not check the isLastAttempt since the yarn-client had no chance to change the value of finalStatus. It's going to the unregister and this application can't recover itself.
So the yarn-client can't support the RM HA now.(yarn-cluster is OK)
And dividing the finalStatus into two parts is an easy way to avoid this problem and compatible with previous design.

tgravescs · 2014-12-30T15:48:14Z

@SaintBacchus so I'm still a bit unclear of the exact scenario. I just want to make sure we are handling everything properly so want to make sure I understand fully.

So this is when the RM goes down and is being brought back up or fails over to a standby. At that point it restarts the applications to start a new attempt. The shutdown hook is run and the code you mention above runs and unregisters. I understand client mode can't set it because spark context is not in the same process. The thing that is unclear to me is how is cluster mode setting the finalStatus to something other then succeeded? Is sparkContext being signalled and then throwing exception so that startUserClass catches it and marks it as failed?

tgravescs · 2014-12-30T16:14:40Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

I assume we are hitting the logic on line 108 above in if (!finished) {... I think that comment and code is based on the final status defaulting to success. In the very least we should update that comment explaining what is going to happen in client vs cluster mode. Since the DisassociatedEvent exits with success for client mode I think making the default as undefined for client mode is fine.

SaintBacchus · 2015-01-04T02:04:48Z

@tgravescs your comment is much more clear than what I said, I have use it instead of mine.Thx.

when Yarn HA event happens, the previous ApplicationMaster will throw a

java.io.IOException: Failed on local exception: java.io.EOFException

So the yarn cluster the catch the exception and change the final status.

But the yarn client will directly go into the ShutDownHook and cause the problem.
I think it haven't go into the DisassociatedEvent yet, because the driver is still alive.

SparkQA · 2015-01-04T03:27:57Z

Test build #25021 has finished for PR 3771 at commit c02bfcc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-01-07T09:27:23Z

Test build #25154 has finished for PR 3771 at commit c02bfcc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2015-01-07T14:09:27Z

thanks @SaintBacchus the changes look good.

Nowadays, yarn-client will exit directly when the HA change happens no matter how many times the am should retry. The reason may be that the default final status only considerred the sys.exit, and the yarn-client HA cann't benefit from this. So we should distinct the default final status between client and cluster, because the SUCCEEDED status may cause the HA failed in client mode and UNDEFINED may cause the error reporter in cluster when using sys.exit. Author: huangzhaowei <carlmartinmax@gmail.com> Closes #3771 from SaintBacchus/YarnHA and squashes the following commits: c02bfcc [huangzhaowei] Improve the comment of the funciton 'getDefaultFinalStatus' 0e69924 [huangzhaowei] Bug fix: fix the yarn-client code to support HA (cherry picked from commit 5fde661) Signed-off-by: Thomas Graves <tgraves@apache.org>

Bug fix: fix the yarn-client code to support HA

0e69924

SaintBacchus changed the title ~~[SPARK-4929] Bug fix: fix the yarn-client code to support HA~~ [YARN][SPARK-4929] Bug fix: fix the yarn-client code to support HA Dec 23, 2014

tgravescs reviewed Dec 30, 2014
View reviewed changes

Improve the comment of the funciton 'getDefaultFinalStatus'

c02bfcc

asfgit closed this in 5fde661 Jan 7, 2015

SaintBacchus deleted the YarnHA branch December 26, 2015 06:40

[YARN][SPARK-4929] Bug fix: fix the yarn-client code to support HA #3771

[YARN][SPARK-4929] Bug fix: fix the yarn-client code to support HA #3771

Uh oh!

Conversation

SaintBacchus commented Dec 23, 2014

Uh oh!

SparkQA commented Dec 23, 2014

Uh oh!

SaintBacchus commented Dec 23, 2014

Uh oh!

JoshRosen commented Dec 23, 2014

Uh oh!

SaintBacchus commented Dec 25, 2014

Uh oh!

tgravescs commented Dec 29, 2014

Uh oh!

SaintBacchus commented Dec 30, 2014

Uh oh!

tgravescs commented Dec 30, 2014

Uh oh!

tgravescs Dec 30, 2014

Choose a reason for hiding this comment

Uh oh!

SaintBacchus commented Jan 4, 2015

Uh oh!

SparkQA commented Jan 4, 2015

Uh oh!

SparkQA commented Jan 7, 2015

Uh oh!

tgravescs commented Jan 7, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants