[SPARK-33173][CORE][TESTS][FOLLOWUP] Use `local[2]` and AtomicInteger #30823

Ngone51 · 2020-12-17T13:36:37Z

What changes were proposed in this pull request?

Use local[2] to let tasks launch at the same time. And change counters (numOnTaskXXX) to AtomicInteger type to ensure thread safe.

Why are the changes needed?

This test is added at SPARK-33088 and still flaky after the fix #30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642

And it's easy to reproduce if you test it multiple times (e.g. 100) locally.

The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one.
The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling PluginContainer.onTaskStart, which leads to the test failure.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested manually after the fix and the test is no longer flaky.

Ngone51 · 2020-12-17T13:38:16Z

cc @fsamuel-bs @dongjoon-hyun @HyukjinKwon @viirya @mridulm Please take a look, thanks!

srowen

Looks OK pending tests. Does this need to go into 3.1? 3.0?

SparkQA · 2020-12-17T14:57:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37559/

SparkQA · 2020-12-17T15:27:26Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37559/

SparkQA · 2020-12-17T16:49:41Z

Test build #132956 has finished for PR 30823 at commit c567315.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

lgtm

dongjoon-hyun

Thank you, @Ngone51 !

### What changes were proposed in this pull request? Use `local[2]` to let tasks launch at the same time. And change counters (`numOnTaskXXX`) to `AtomicInteger` type to ensure thread safe. ### Why are the changes needed? The test is still flaky after the fix #30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642 And it's easy to reproduce if you test it multiple times (e.g. 100) locally. The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one. The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling `PluginContainer.onTaskStart`, which leads to the test failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested manually after the fix and the test is no longer flaky. Closes #30823 from Ngone51/debug-flaky-spark-33088. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 15616f4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2020-12-17T17:30:02Z

Since this test is added at SPARK-33088, merged to master/3.1.

mridulm · 2020-12-17T19:11:19Z

Late +1
Thanks @Ngone51 !

HyukjinKwon · 2020-12-18T01:12:01Z

Nice, thanks @Ngone51.

Ngone51 · 2020-12-18T02:40:41Z

thanks all!

Ngone51 added 3 commits December 17, 2020 21:15

fix

2398680

.

5310448

.

c567315

github-actions bot added the CORE label Dec 17, 2020

Ngone51 mentioned this pull request Dec 17, 2020

[SPARK-33756][SQL] Make BytesToBytesMap's MapIterator idempotent #30728

Closed

srowen approved these changes Dec 17, 2020

View reviewed changes

viirya approved these changes Dec 17, 2020

View reviewed changes

dongjoon-hyun approved these changes Dec 17, 2020

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-33173][CORE][TESTS] Fix flaky "SPARK-33088: executor failed tasks trigger plugin calls" in PluginContainerSuite~~ [SPARK-33173][CORE][TESTS][FOLLOWUP] Use local[2] and AtomicInteger Dec 17, 2020

dongjoon-hyun closed this in 15616f4 Dec 17, 2020

Ngone51 deleted the debug-flaky-spark-33088 branch December 18, 2020 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33173][CORE][TESTS][FOLLOWUP] Use `local[2]` and AtomicInteger #30823

[SPARK-33173][CORE][TESTS][FOLLOWUP] Use `local[2]` and AtomicInteger #30823

Ngone51 commented Dec 17, 2020 •

edited by dongjoon-hyun

Loading

Ngone51 commented Dec 17, 2020

srowen left a comment

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

viirya left a comment

dongjoon-hyun left a comment

dongjoon-hyun commented Dec 17, 2020

mridulm commented Dec 17, 2020

HyukjinKwon commented Dec 18, 2020

Ngone51 commented Dec 18, 2020

[SPARK-33173][CORE][TESTS][FOLLOWUP] Use local[2] and AtomicInteger #30823

[SPARK-33173][CORE][TESTS][FOLLOWUP] Use local[2] and AtomicInteger #30823

Conversation

Ngone51 commented Dec 17, 2020 • edited by dongjoon-hyun Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Ngone51 commented Dec 17, 2020

srowen left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

viirya left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 17, 2020

mridulm commented Dec 17, 2020

HyukjinKwon commented Dec 18, 2020

Ngone51 commented Dec 18, 2020

[SPARK-33173][CORE][TESTS][FOLLOWUP] Use `local[2]` and AtomicInteger #30823

[SPARK-33173][CORE][TESTS][FOLLOWUP] Use `local[2]` and AtomicInteger #30823

Ngone51 commented Dec 17, 2020 •

edited by dongjoon-hyun

Loading