Skip to content

Conversation

@SaintBacchus
Copy link
Contributor

Nowadays, yarn-client will exit directly when the HA change happens no matter how many times the am should retry.
The reason may be that the default final status only considerred the sys.exit, and the yarn-client HA cann't benefit from this.
So we should distinct the default final status between client and cluster, because the SUCCEEDED status may cause the HA failed in client mode and UNDEFINED may cause the error reporter in cluster when using sys.exit.

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #24726 has finished for PR 3771 at commit 0e69924.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SaintBacchus SaintBacchus changed the title [SPARK-4929] Bug fix: fix the yarn-client code to support HA [YARN][SPARK-4929] Bug fix: fix the yarn-client code to support HA Dec 23, 2014
@SaintBacchus
Copy link
Contributor Author

@andrewor14 can you go through this problem? Thx.

@JoshRosen
Copy link
Contributor

Also, /cc @tgravescs, another one of our YARN maintainers.

@SaintBacchus
Copy link
Contributor Author

@tgravescs can you hava a look at this problem?

@tgravescs
Copy link
Contributor

Can you please be a bit more specific and detail out exact what happens here? Are you referring to when RM has to failover or during rolling upgrade. Is the container brought down and then back up again... please just describe the scenario and what exactly is happening.

@SaintBacchus
Copy link
Contributor Author

what @tgravescs says is close to the scenario, but it happens during the RM recover after broke down.

            if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) {
              unregister(finalStatus, finalMsg)
              cleanupStagingDir(fs)
            }

In the code, it won't check the isLastAttempt if the finalStatus was FinalApplicationStatus.SUCCEEDED .
When the RM recovering happens, it would not check the isLastAttempt since the yarn-client had no chance to change the value of finalStatus. It's going to the unregister and this application can't recover itself.
So the yarn-client can't support the RM HA now.(yarn-cluster is OK)
And dividing the finalStatus into two parts is an easy way to avoid this problem and compatible with previous design.

@tgravescs
Copy link
Contributor

@SaintBacchus so I'm still a bit unclear of the exact scenario. I just want to make sure we are handling everything properly so want to make sure I understand fully.

So this is when the RM goes down and is being brought back up or fails over to a standby. At that point it restarts the applications to start a new attempt. The shutdown hook is run and the code you mention above runs and unregisters. I understand client mode can't set it because spark context is not in the same process. The thing that is unclear to me is how is cluster mode setting the finalStatus to something other then succeeded? Is sparkContext being signalled and then throwing exception so that startUserClass catches it and marks it as failed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we are hitting the logic on line 108 above in if (!finished) {... I think that comment and code is based on the final status defaulting to success. In the very least we should update that comment explaining what is going to happen in client vs cluster mode. Since the DisassociatedEvent exits with success for client mode I think making the default as undefined for client mode is fine.

@SaintBacchus
Copy link
Contributor Author

@tgravescs your comment is much more clear than what I said, I have use it instead of mine.Thx.

when Yarn HA event happens, the previous ApplicationMaster will throw a

java.io.IOException: Failed on local exception: java.io.EOFException

So the yarn cluster the catch the exception and change the final status.

But the yarn client will directly go into the ShutDownHook and cause the problem.
I think it haven't go into the DisassociatedEvent yet, because the driver is still alive.

@SparkQA
Copy link

SparkQA commented Jan 4, 2015

Test build #25021 has finished for PR 3771 at commit c02bfcc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 7, 2015

Test build #25154 has finished for PR 3771 at commit c02bfcc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

thanks @SaintBacchus the changes look good.

asfgit pushed a commit that referenced this pull request Jan 7, 2015
Nowadays, yarn-client will exit directly when the HA change happens no matter how many times the am should retry.
The reason may be that the default final status only considerred the sys.exit, and the yarn-client HA cann't benefit from this.
So we should distinct the default final status between client and cluster, because the SUCCEEDED status may cause the HA failed in client mode and UNDEFINED may cause the error reporter in cluster when using sys.exit.

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #3771 from SaintBacchus/YarnHA and squashes the following commits:

c02bfcc [huangzhaowei] Improve the comment of the funciton 'getDefaultFinalStatus'
0e69924 [huangzhaowei] Bug fix: fix the yarn-client code to support HA

(cherry picked from commit 5fde661)
Signed-off-by: Thomas Graves <tgraves@apache.org>
@asfgit asfgit closed this in 5fde661 Jan 7, 2015
@SaintBacchus SaintBacchus deleted the YarnHA branch December 26, 2015 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants