Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33173][CORE][TESTS][FOLLOWUP] Use local[2] and AtomicInteger #30823

Closed
wants to merge 3 commits into from

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Dec 17, 2020

What changes were proposed in this pull request?

Use local[2] to let tasks launch at the same time. And change counters (numOnTaskXXX) to AtomicInteger type to ensure thread safe.

Why are the changes needed?

This test is added at SPARK-33088 and still flaky after the fix #30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642

And it's easy to reproduce if you test it multiple times (e.g. 100) locally.

The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one.
The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling PluginContainer.onTaskStart, which leads to the test failure.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested manually after the fix and the test is no longer flaky.

@github-actions github-actions bot added the CORE label Dec 17, 2020
@Ngone51
Copy link
Member Author

Ngone51 commented Dec 17, 2020

cc @fsamuel-bs @dongjoon-hyun @HyukjinKwon @viirya @mridulm Please take a look, thanks!

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK pending tests. Does this need to go into 3.1? 3.0?

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37559/

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37559/

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Test build #132956 has finished for PR 30823 at commit c567315.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @Ngone51 !

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-33173][CORE][TESTS] Fix flaky "SPARK-33088: executor failed tasks trigger plugin calls" in PluginContainerSuite [SPARK-33173][CORE][TESTS][FOLLOWUP] Use local[2] and AtomicInteger Dec 17, 2020
dongjoon-hyun pushed a commit that referenced this pull request Dec 17, 2020
### What changes were proposed in this pull request?

Use `local[2]` to let tasks launch at the same time. And change counters (`numOnTaskXXX`) to `AtomicInteger` type to ensure thread safe.

### Why are the changes needed?

The test is still flaky after the fix #30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642

And it's easy to reproduce if you test it multiple times (e.g. 100) locally.

The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one.
The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode  (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling `PluginContainer.onTaskStart`, which leads to the test failure.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested manually after the fix and the test is no longer flaky.

Closes #30823 from Ngone51/debug-flaky-spark-33088.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 15616f4)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Since this test is added at SPARK-33088, merged to master/3.1.

@mridulm
Copy link
Contributor

mridulm commented Dec 17, 2020

Late +1
Thanks @Ngone51 !

@HyukjinKwon
Copy link
Member

Nice, thanks @Ngone51.

@Ngone51
Copy link
Member Author

Ngone51 commented Dec 18, 2020

thanks all!

@Ngone51 Ngone51 deleted the debug-flaky-spark-33088 branch December 18, 2020 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants