-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4346][SPARK-3596][YARN] Commonize the monitor logic #5305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Sephiroth-Lin
commented
Apr 1, 2015
- YarnClientSchedulerBack.asyncMonitorApplication use Client.monitorApplication so that commonize the monitor logic
- Support changing the yarn client monitor interval, see [SPARK-3596][YARN]Support changing the yarn client monitor interval #5292
- More details see discussion on [SPARK-4282][YARN] Stopping flag in YarnClientSchedulerBackend should be volatile #3143
Client.monitorApplication
|
Jenkins, add to whitelist |
|
ok to test |
|
Test build #29536 has finished for PR 5305 at commit
|
|
Jenkins, retest please |
|
@srowen unit tests failed at run Python app on yarn-cluster mode, I think this didn't cause by this PR, please ask jenkins to retest, thank you. |
|
Jenkins, retest this please. |
|
Test build #29604 has finished for PR 5305 at commit
|
|
CC a few people who have touched this bit of the code: @kasjain @witgo @andrewor14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process doesn't look the same. ApplicationNotFoundException is not handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can just add it in Client#monitorApplication as well
|
Hi @Sephiroth-Lin this looks good for the most part. The only reason why I didn't merge these code paths initially is because the loop here checks on Once you address @witgo's comment I will merge this thanks. |
|
Test build #29643 has finished for PR 5305 at commit
|
|
Test build #29644 has finished for PR 5305 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably don't need to wrap this entire block in the try. You could for instance do the following instead:
val report: ApplicationReport =
try {
getApplicationReport(appId)
} catch {
case e: ApplicationNotFoundException =>
return (..., ...)
}
val state = report.getYarnApplicationState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thank you!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop() can be called from other areas of the code (like SparkContext.stop()). Now that the loop isn't checking for it we wouldn't interrupt this thread if that happens and I think we need to handle that case. See the discussions on #3143
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. We might need to put this in a private var monitorThread or something and interrupt it in stop(). (@Sephiroth-Lin I would make this method return the thread and set monitorThread = asyncMonitorApplication() in start())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We need to interrupt the thread in stop().
Conflicts: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
|
Test build #29827 has finished for PR 5305 at commit
|
|
Nice, LGTM merging into master thanks @Sephiroth-Lin @tgravescs @witgo |
…tate monitor thread been interrupted On PR #5305 we interrupt the monitor thread but forget to catch the InterruptedException, then in the log will print the stack info, so we need to catch it. Author: linweizhong <linweizhong@huawei.com> Closes #5479 from Sephiroth-Lin/SPARK-6870 and squashes the following commits: f775f93 [linweizhong] Update, don't need to call Thread.currentThread() on monitor thread 0e2ef1f [linweizhong] Update 0d8958a [linweizhong] Update 3513fdb [linweizhong] Catch InterruptedException