-
Notifications
You must be signed in to change notification settings - Fork 29k
[YARN][SPARK-4929] Bug fix: fix the yarn-client code to support HA #3771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #24726 has finished for PR 3771 at commit
|
|
@andrewor14 can you go through this problem? Thx. |
|
Also, /cc @tgravescs, another one of our YARN maintainers. |
|
@tgravescs can you hava a look at this problem? |
|
Can you please be a bit more specific and detail out exact what happens here? Are you referring to when RM has to failover or during rolling upgrade. Is the container brought down and then back up again... please just describe the scenario and what exactly is happening. |
|
what @tgravescs says is close to the scenario, but it happens during the RM recover after broke down. if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) {
unregister(finalStatus, finalMsg)
cleanupStagingDir(fs)
}In the code, it won't check the |
|
@SaintBacchus so I'm still a bit unclear of the exact scenario. I just want to make sure we are handling everything properly so want to make sure I understand fully. So this is when the RM goes down and is being brought back up or fails over to a standby. At that point it restarts the applications to start a new attempt. The shutdown hook is run and the code you mention above runs and unregisters. I understand client mode can't set it because spark context is not in the same process. The thing that is unclear to me is how is cluster mode setting the finalStatus to something other then succeeded? Is sparkContext being signalled and then throwing exception so that startUserClass catches it and marks it as failed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we are hitting the logic on line 108 above in if (!finished) {... I think that comment and code is based on the final status defaulting to success. In the very least we should update that comment explaining what is going to happen in client vs cluster mode. Since the DisassociatedEvent exits with success for client mode I think making the default as undefined for client mode is fine.
|
@tgravescs your comment is much more clear than what I said, I have use it instead of mine.Thx. when Yarn HA event happens, the previous ApplicationMaster will throw a So the yarn cluster the catch the exception and change the final status. But the yarn client will directly go into the ShutDownHook and cause the problem. |
|
Test build #25021 has finished for PR 3771 at commit
|
|
Test build #25154 has finished for PR 3771 at commit
|
|
thanks @SaintBacchus the changes look good. |
Nowadays, yarn-client will exit directly when the HA change happens no matter how many times the am should retry. The reason may be that the default final status only considerred the sys.exit, and the yarn-client HA cann't benefit from this. So we should distinct the default final status between client and cluster, because the SUCCEEDED status may cause the HA failed in client mode and UNDEFINED may cause the error reporter in cluster when using sys.exit. Author: huangzhaowei <carlmartinmax@gmail.com> Closes #3771 from SaintBacchus/YarnHA and squashes the following commits: c02bfcc [huangzhaowei] Improve the comment of the funciton 'getDefaultFinalStatus' 0e69924 [huangzhaowei] Bug fix: fix the yarn-client code to support HA (cherry picked from commit 5fde661) Signed-off-by: Thomas Graves <tgraves@apache.org>
Nowadays, yarn-client will exit directly when the HA change happens no matter how many times the am should retry.
The reason may be that the default final status only considerred the sys.exit, and the yarn-client HA cann't benefit from this.
So we should distinct the default final status between client and cluster, because the SUCCEEDED status may cause the HA failed in client mode and UNDEFINED may cause the error reporter in cluster when using sys.exit.