-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33088][CORE] Enhance ExecutorPlugin API to include callbacks on task start and end events #29977
Conversation
@vanzin, @tgravescs, @LucaCanali tagging you all as you seem to have worked on the latest refactor on this area of the code on #26170. |
thanks for adding this. I can definitely see use cases for this and have thought it would be nice to have something like this. I want to look at the api details more. You also have other situations where tasks are killed, which luckily I guess the TaskContext does have a isInterrupted field for that. The executor ends up making some pass/fail decisions that go to the scheduler |
@@ -332,7 +332,8 @@ private[spark] class Executor( | |||
|
|||
class TaskRunner( | |||
execBackend: ExecutorBackend, | |||
private val taskDescription: TaskDescription) | |||
private val taskDescription: TaskDescription, | |||
private val plugins: Option[PluginContainer]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can make the plugins
parameter optional or default to some EmptyPluginContainer
?:
private val plugins: Option[PluginContainer]) | |
private val plugins: Option[PluginContainer] = None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for Task#run
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rshkv, what is the reason to make this default to None? This is an internal api and only called from here. It's an option already so people can check it easily. In some ways its nice to force it so you make sure all uses of it have been updated.
Are there cases you know this is used outside Spark?
I see. For my use-case, I don't really care about the distinction of Could expose instead an |
@@ -123,8 +125,12 @@ private[spark] abstract class Task[T]( | |||
Option(taskAttemptId), | |||
Option(attemptNumber)).setCurrentContext() | |||
|
|||
plugins.foreach(_.onTaskStart()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expectation in case onTaskStart
fails - do we want to invoke succeeded/failed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I documented on https://github.com/apache/spark/pull/29977/files#diff-6a99ec9983962323b4e0c1899134b5d6R76-R78 -- argument that came to mind is that it's easy for a plugin dev to track some state in a thread-local and clean decide if it wants to perform the succeeded/failed action or not.
Happy to change it if we prefer not to put this burden on the plugin owner though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I'm misunderstanding but the documentation states "Exceptions thrown from this method do not propagate", there is nothing here preventing that. I think perhaps you meant to say the user needs to make sure they don't propagate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We catch Throwable on ExecutorPluginContainer#onTaskStart
and siblings (see https://github.com/apache/spark/pull/29977/files#diff-5e4d939e9bb53b4be2c48d4eb53b885c162c729b9adc874f918f7701a352cdbbR157), so that's what I meant by "not propagate". I.e. if a plugin's onTaskStart
throws, Spark will log, but won't fail the associated spark task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps reword it to say exceptions are ignored ?
The changes mostly look good, thanks for working on it @fsamuel-bs ! |
Currently we document a) Succeeded. Do we want to distinguish b.1 from b.2 explicitly via api ? ( |
That is exactly what I wanted to look at some more. I think we should definitely keep something that allows user to tell if task passed or failed. @mridulm thoughts on leaving here vs we could just move the task end notification about into the Executor which would be in the same thread still. That feels like it would be more reliable with the plugin knowing same status of task that the scheduler will see. |
@@ -123,8 +125,12 @@ private[spark] abstract class Task[T]( | |||
Option(taskAttemptId), | |||
Option(attemptNumber)).setCurrentContext() | |||
|
|||
plugins.foreach(_.onTaskStart()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I'm misunderstanding but the documentation states "Exceptions thrown from this method do not propagate", there is nothing here preventing that. I think perhaps you meant to say the user needs to make sure they don't propagate?
@@ -332,7 +332,8 @@ private[spark] class Executor( | |||
|
|||
class TaskRunner( | |||
execBackend: ExecutorBackend, | |||
private val taskDescription: TaskDescription) | |||
private val taskDescription: TaskDescription, | |||
private val plugins: Option[PluginContainer]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rshkv, what is the reason to make this default to None? This is an internal api and only called from here. It's an option already so people can check it easily. In some ways its nice to force it so you make sure all uses of it have been updated.
Are there cases you know this is used outside Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to move onTaskSucceeded
/onTaskFailed
to Executor, shouldn't be that much work, just pending agreement from @mridulm.
@@ -123,8 +125,12 @@ private[spark] abstract class Task[T]( | |||
Option(taskAttemptId), | |||
Option(attemptNumber)).setCurrentContext() | |||
|
|||
plugins.foreach(_.onTaskStart()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We catch Throwable on ExecutorPluginContainer#onTaskStart
and siblings (see https://github.com/apache/spark/pull/29977/files#diff-5e4d939e9bb53b4be2c48d4eb53b885c162c729b9adc874f918f7701a352cdbbR157), so that's what I meant by "not propagate". I.e. if a plugin's onTaskStart
throws, Spark will log, but won't fail the associated spark task.
Thanks @tgravescs, I did miss out on other cases where task can fail due to spark infra causing the task to fail (commit denied is a very good example). If it helps with catching more corner cases, without spi impl's having to duplicate what spark does - that is a great direction to take. |
@mridulm @tgravescs: I've moved triggering the methods from |
the test that failed was barrier task context, I rekicked the tests, if it fails again we should look at to ensure something here didn't break it. |
ok to test |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129841 has finished for PR 29977 at commit
|
Thanks for the reviews @rshkv and @tgravescs ! Thanks for the contribution @fsamuel-bs ! |
I added @fsamuel-bs as a contributor in jira and assigned it to him. thanks.. |
|
||
assert(TestSparkPlugin.executorPlugin.numOnTaskStart == 2) | ||
assert(TestSparkPlugin.executorPlugin.numOnTaskSucceeded == 0) | ||
assert(TestSparkPlugin.executorPlugin.numOnTaskFailed == 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, folks.
It turns out that this is a flaky test. I filed a JIRA issue and made PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongjoon-hyun
…n task start and end events Proposing a new set of APIs for ExecutorPlugins, to provide callbacks invoked at the start and end of each task of a job. Not very opinionated on the shape of the API, tried to be as minimal as possible for now. Changes described in detail on [SPARK-33088](https://issues.apache.org/jira/browse/SPARK-33088), but mostly this boils down to: 1. This feature was considered when the ExecutorPlugin API was initially introduced in apache#21923, but never implemented. 2. The use-case which **requires** this feature is to propagate tracing information from the driver to the executor, such that calls from the same job can all be traced. a. Tracing frameworks usually are setup in thread locals, therefore it's important for the setup to happen in the same thread which runs the tasks. b. Executors can be for multiple jobs, therefore it's not sufficient to set tracing information at executor startup time -- it needs to happen every time a task starts or ends. No. This PR introduces new features for future developers to use. Unit tests on `PluginContainerSuite`. Closes apache#29977 from fsamuel-bs/SPARK-33088. Authored-by: Samuel Souza <ssouza@palantir.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…n task start and end events Proposing a new set of APIs for ExecutorPlugins, to provide callbacks invoked at the start and end of each task of a job. Not very opinionated on the shape of the API, tried to be as minimal as possible for now. Changes described in detail on [SPARK-33088](https://issues.apache.org/jira/browse/SPARK-33088), but mostly this boils down to: 1. This feature was considered when the ExecutorPlugin API was initially introduced in apache#21923, but never implemented. 2. The use-case which **requires** this feature is to propagate tracing information from the driver to the executor, such that calls from the same job can all be traced. a. Tracing frameworks usually are setup in thread locals, therefore it's important for the setup to happen in the same thread which runs the tasks. b. Executors can be for multiple jobs, therefore it's not sufficient to set tracing information at executor startup time -- it needs to happen every time a task starts or ends. No. This PR introduces new features for future developers to use. Unit tests on `PluginContainerSuite`. Closes apache#29977 from fsamuel-bs/SPARK-33088. Authored-by: Samuel Souza <ssouza@palantir.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…n task start and end events Proposing a new set of APIs for ExecutorPlugins, to provide callbacks invoked at the start and end of each task of a job. Not very opinionated on the shape of the API, tried to be as minimal as possible for now. Changes described in detail on [SPARK-33088](https://issues.apache.org/jira/browse/SPARK-33088), but mostly this boils down to: 1. This feature was considered when the ExecutorPlugin API was initially introduced in apache#21923, but never implemented. 2. The use-case which **requires** this feature is to propagate tracing information from the driver to the executor, such that calls from the same job can all be traced. a. Tracing frameworks usually are setup in thread locals, therefore it's important for the setup to happen in the same thread which runs the tasks. b. Executors can be for multiple jobs, therefore it's not sufficient to set tracing information at executor startup time -- it needs to happen every time a task starts or ends. No. This PR introduces new features for future developers to use. Unit tests on `PluginContainerSuite`. Closes apache#29977 from fsamuel-bs/SPARK-33088. Authored-by: Samuel Souza <ssouza@palantir.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…n task start and end events Proposing a new set of APIs for ExecutorPlugins, to provide callbacks invoked at the start and end of each task of a job. Not very opinionated on the shape of the API, tried to be as minimal as possible for now. Changes described in detail on [SPARK-33088](https://issues.apache.org/jira/browse/SPARK-33088), but mostly this boils down to: 1. This feature was considered when the ExecutorPlugin API was initially introduced in apache#21923, but never implemented. 2. The use-case which **requires** this feature is to propagate tracing information from the driver to the executor, such that calls from the same job can all be traced. a. Tracing frameworks usually are setup in thread locals, therefore it's important for the setup to happen in the same thread which runs the tasks. b. Executors can be for multiple jobs, therefore it's not sufficient to set tracing information at executor startup time -- it needs to happen every time a task starts or ends. No. This PR introduces new features for future developers to use. Unit tests on `PluginContainerSuite`. Closes apache#29977 from fsamuel-bs/SPARK-33088. Authored-by: Samuel Souza <ssouza@palantir.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…n task start and end events Proposing a new set of APIs for ExecutorPlugins, to provide callbacks invoked at the start and end of each task of a job. Not very opinionated on the shape of the API, tried to be as minimal as possible for now. Changes described in detail on [SPARK-33088](https://issues.apache.org/jira/browse/SPARK-33088), but mostly this boils down to: 1. This feature was considered when the ExecutorPlugin API was initially introduced in apache#21923, but never implemented. 2. The use-case which **requires** this feature is to propagate tracing information from the driver to the executor, such that calls from the same job can all be traced. a. Tracing frameworks usually are setup in thread locals, therefore it's important for the setup to happen in the same thread which runs the tasks. b. Executors can be for multiple jobs, therefore it's not sufficient to set tracing information at executor startup time -- it needs to happen every time a task starts or ends. No. This PR introduces new features for future developers to use. Unit tests on `PluginContainerSuite`. Closes apache#29977 from fsamuel-bs/SPARK-33088. Authored-by: Samuel Souza <ssouza@palantir.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…n task start and end events Proposing a new set of APIs for ExecutorPlugins, to provide callbacks invoked at the start and end of each task of a job. Not very opinionated on the shape of the API, tried to be as minimal as possible for now. Changes described in detail on [SPARK-33088](https://issues.apache.org/jira/browse/SPARK-33088), but mostly this boils down to: 1. This feature was considered when the ExecutorPlugin API was initially introduced in apache#21923, but never implemented. 2. The use-case which **requires** this feature is to propagate tracing information from the driver to the executor, such that calls from the same job can all be traced. a. Tracing frameworks usually are setup in thread locals, therefore it's important for the setup to happen in the same thread which runs the tasks. b. Executors can be for multiple jobs, therefore it's not sufficient to set tracing information at executor startup time -- it needs to happen every time a task starts or ends. No. This PR introduces new features for future developers to use. Unit tests on `PluginContainerSuite`. Closes apache#29977 from fsamuel-bs/SPARK-33088. Authored-by: Samuel Souza <ssouza@palantir.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…n task start and end events Proposing a new set of APIs for ExecutorPlugins, to provide callbacks invoked at the start and end of each task of a job. Not very opinionated on the shape of the API, tried to be as minimal as possible for now. Changes described in detail on [SPARK-33088](https://issues.apache.org/jira/browse/SPARK-33088), but mostly this boils down to: 1. This feature was considered when the ExecutorPlugin API was initially introduced in apache#21923, but never implemented. 2. The use-case which **requires** this feature is to propagate tracing information from the driver to the executor, such that calls from the same job can all be traced. a. Tracing frameworks usually are setup in thread locals, therefore it's important for the setup to happen in the same thread which runs the tasks. b. Executors can be for multiple jobs, therefore it's not sufficient to set tracing information at executor startup time -- it needs to happen every time a task starts or ends. No. This PR introduces new features for future developers to use. Unit tests on `PluginContainerSuite`. Closes apache#29977 from fsamuel-bs/SPARK-33088. Authored-by: Samuel Souza <ssouza@palantir.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
What changes were proposed in this pull request?
Proposing a new set of APIs for ExecutorPlugins, to provide callbacks invoked at the start and end of each task of a job. Not very opinionated on the shape of the API, tried to be as minimal as possible for now.
Why are the changes needed?
Changes described in detail on SPARK-33088, but mostly this boils down to:
a. Tracing frameworks usually are setup in thread locals, therefore it's important for the setup to happen in the same thread which runs the tasks.
b. Executors can be for multiple jobs, therefore it's not sufficient to set tracing information at executor startup time -- it needs to happen every time a task starts or ends.
Does this PR introduce any user-facing change?
No. This PR introduces new features for future developers to use.
How was this patch tested?
Unit tests on
PluginContainerSuite
.