-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54087][CORE] Spark Executor launch task failed should return task killed message #52792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
725c549
2fe5144
2e58abb
27a3a46
776f58a
bbcbf19
87383d4
232c7eb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -381,7 +381,28 @@ private[spark] class Executor( | |
| tr.kill(killMark._1, killMark._2) | ||
| killMarks.remove(taskId) | ||
| } | ||
| threadPool.execute(tr) | ||
| try { | ||
| threadPool.execute(tr) | ||
| } catch { | ||
| case t: Throwable => | ||
| try { | ||
| logError(log"Executor launch task ${MDC(TASK_NAME, taskDescription.name)} failed," + | ||
| log" reason: ${MDC(REASON, t.getMessage)}") | ||
| context.statusUpdate( | ||
| taskDescription.taskId, | ||
| TaskState.FAILED, | ||
| env.closureSerializer.newInstance().serialize(new ExceptionFailure(t, Seq.empty))) | ||
| } catch { | ||
| case oom: OutOfMemoryError => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm always skeptical to special cases, why OOM error is special here?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need @Ngone51 answer your question, there's no need to handle this separately for me.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It's very likely to hit the OOM error again given that we're already in the OOM situation. Therefore, in the case of OOM, we should not expect the following operation could be always succesful. The special catch for the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, can we add a code comment to explain it?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are we expecting
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
When sending RPC, if client is not created, need to create new connect task, may need create a new thread, can throw OOM since can't create thread. but always this case won't happen since client with driver should always be already created
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @cloud-fan - special handling, especially of OOM, is not very robust.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need to make a followup pr? cc @cloud-fan
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes please! |
||
| logError(log"Executor update launching task ${MDC(TASK_NAME, taskDescription.name)} " + | ||
| log"failed status failed, reason: ${MDC(REASON, oom.getMessage)}") | ||
| System.exit(SparkExitCode.OOM) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. any other special exit code for certain exceptions?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We only meet the case is create thread failed, then throw OutOfMemoryError |
||
| case t: Throwable => | ||
| logError(log"Executor update launching task ${MDC(TASK_NAME, taskDescription.name)} " + | ||
| log"failed status failed, reason: ${MDC(REASON, t.getMessage)}") | ||
| System.exit(-1) | ||
| } | ||
| } | ||
| if (decommissioned) { | ||
| log.error(s"Launching a task while in decommissioned state.") | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so any place that invokes
context.statusUpdateneeds to add try-catch to crash the executor?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ngone51 means that if the RPC requests also fail due to OutOfMemoryError (OOM), exiting the executor can trigger corresponding task rerun to avoid getting stuck.
In my PR's scenario, this might be necessary in a special case. However, if the task is already running, I think it's unnecessary.