-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33173][CORE][TESTS][FOLLOWUP] Use local[2]
and AtomicInteger
#30823
Conversation
cc @fsamuel-bs @dongjoon-hyun @HyukjinKwon @viirya @mridulm Please take a look, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK pending tests. Does this need to go into 3.1? 3.0?
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #132956 has finished for PR 30823 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @Ngone51 !
local[2]
and AtomicInteger
### What changes were proposed in this pull request? Use `local[2]` to let tasks launch at the same time. And change counters (`numOnTaskXXX`) to `AtomicInteger` type to ensure thread safe. ### Why are the changes needed? The test is still flaky after the fix #30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642 And it's easy to reproduce if you test it multiple times (e.g. 100) locally. The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one. The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling `PluginContainer.onTaskStart`, which leads to the test failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested manually after the fix and the test is no longer flaky. Closes #30823 from Ngone51/debug-flaky-spark-33088. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 15616f4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Since this test is added at SPARK-33088, merged to master/3.1. |
Late +1 |
Nice, thanks @Ngone51. |
thanks all! |
What changes were proposed in this pull request?
Use
local[2]
to let tasks launch at the same time. And change counters (numOnTaskXXX
) toAtomicInteger
type to ensure thread safe.Why are the changes needed?
This test is added at SPARK-33088 and still flaky after the fix #30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642
And it's easy to reproduce if you test it multiple times (e.g. 100) locally.
The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one.
The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling
PluginContainer.onTaskStart
, which leads to the test failure.Does this PR introduce any user-facing change?
No
How was this patch tested?
Tested manually after the fix and the test is no longer flaky.