[SPARK-33173][CORE][TESTS] Use `eventually` to check `numOnTaskFailed` in PluginContainerSuite #30072

dongjoon-hyun · 2020-10-17T00:38:17Z

What changes were proposed in this pull request?

This PR aims to use eventually to fix the flakiness of the test case SPARK-33088: executor failed tasks trigger plugin calls.

Why are the changes needed?

The test case checks like the following.

assert(TestSparkPlugin.executorPlugin.numOnTaskStart == 2)
assert(TestSparkPlugin.executorPlugin.numOnTaskSucceeded == 0)
assert(TestSparkPlugin.executorPlugin.numOnTaskFailed == 2)

Although first and second passed, the third can fail.

sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 1 did not equal 2
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
	at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
	at org.apache.spark.internal.plugin.PluginContainerSuite.$anonfun$new$8(PluginContainerSuite.scala:161)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This only improves the robustness.

dongjoon-hyun · 2020-10-17T00:48:25Z

cc @mridulm and @tgravescs

SparkQA · 2020-10-17T01:21:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34528/

dongjoon-hyun · 2020-10-17T01:50:14Z

Could you review this, @viirya ?

SparkQA · 2020-10-17T01:51:33Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34528/

dongjoon-hyun · 2020-10-17T01:52:13Z

This is very flaky. I saw this at least 3 places (Jenkins, my another PR and @sunchao 's PR).

viirya

lgtm

HyukjinKwon

Looks good. Let's make it pass first.

dongjoon-hyun · 2020-10-17T02:19:34Z

Thanks, @viirya and @HyukjinKwon .

SparkQA · 2020-10-17T03:23:37Z

Test build #129923 has finished for PR 30072 at commit c5f9c1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-10-17T04:23:09Z

Thank you, @viirya , @HyukjinKwon , @mridulm .
Merged to master.

### What changes were proposed in this pull request? Use `local[2]` to let tasks launch at the same time. And change counters (`numOnTaskXXX`) to `AtomicInteger` type to ensure thread safe. ### Why are the changes needed? The test is still flaky after the fix #30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642 And it's easy to reproduce if you test it multiple times (e.g. 100) locally. The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one. The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling `PluginContainer.onTaskStart`, which leads to the test failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested manually after the fix and the test is no longer flaky. Closes #30823 from Ngone51/debug-flaky-spark-33088. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Use `local[2]` to let tasks launch at the same time. And change counters (`numOnTaskXXX`) to `AtomicInteger` type to ensure thread safe. ### Why are the changes needed? The test is still flaky after the fix #30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642 And it's easy to reproduce if you test it multiple times (e.g. 100) locally. The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one. The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling `PluginContainer.onTaskStart`, which leads to the test failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested manually after the fix and the test is no longer flaky. Closes #30823 from Ngone51/debug-flaky-spark-33088. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 15616f4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…` in PluginContainerSuite ### What changes were proposed in this pull request? This PR aims to use `eventually` to fix the flakiness of the test case `SPARK-33088: executor failed tasks trigger plugin calls`. ### Why are the changes needed? The test case checks like the following. ```scala assert(TestSparkPlugin.executorPlugin.numOnTaskStart == 2) assert(TestSparkPlugin.executorPlugin.numOnTaskSucceeded == 0) assert(TestSparkPlugin.executorPlugin.numOnTaskFailed == 2) ``` Although first and second passed, the third can fail. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11/lastCompletedBuild/testReport/org.apache.spark.internal.plugin/PluginContainerSuite/SPARK_33088__executor_failed_tasks_trigger_plugin_calls/ - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129919/testReport/ ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 1 did not equal 2 at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) at org.apache.spark.internal.plugin.PluginContainerSuite.$anonfun$new$8(PluginContainerSuite.scala:161) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This only improves the robustness. Closes apache#30072 from dongjoon-hyun/SPARK-33173. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

[SPARK-33173][CORE][TESTS] Use eventually to check numOnTaskFailed

c5f9c1e

dongjoon-hyun mentioned this pull request Oct 17, 2020

[SPARK-33088][CORE] Enhance ExecutorPlugin API to include callbacks on task start and end events #29977

Closed

dongjoon-hyun changed the title ~~[SPARK-33173][CORE][TESTS] Use eventually to check numOnTaskFailed~~ [SPARK-33173][CORE][TESTS] Use eventually to check numOnTaskFailed in PluginContainerSuite Oct 17, 2020

viirya approved these changes Oct 17, 2020

View reviewed changes

HyukjinKwon approved these changes Oct 17, 2020

View reviewed changes

mridulm approved these changes Oct 17, 2020

View reviewed changes

dongjoon-hyun closed this in 911dcd3 Oct 17, 2020

dongjoon-hyun deleted the SPARK-33173 branch October 17, 2020 04:23

Ngone51 mentioned this pull request Dec 17, 2020

[SPARK-33173][CORE][TESTS][FOLLOWUP] Use local[2] and AtomicInteger #30823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-33173][CORE][TESTS] Use `eventually` to check `numOnTaskFailed` in PluginContainerSuite #30072

[SPARK-33173][CORE][TESTS] Use `eventually` to check `numOnTaskFailed` in PluginContainerSuite #30072

Uh oh!

dongjoon-hyun commented Oct 17, 2020 •

edited

Loading

Uh oh!

dongjoon-hyun commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

dongjoon-hyun commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

dongjoon-hyun commented Oct 17, 2020

Uh oh!

viirya left a comment

Uh oh!

HyukjinKwon left a comment

Uh oh!

dongjoon-hyun commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

dongjoon-hyun commented Oct 17, 2020

Uh oh!

Uh oh!

[SPARK-33173][CORE][TESTS] Use eventually to check numOnTaskFailed in PluginContainerSuite #30072

[SPARK-33173][CORE][TESTS] Use eventually to check numOnTaskFailed in PluginContainerSuite #30072

Uh oh!

Conversation

dongjoon-hyun commented Oct 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

dongjoon-hyun commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

dongjoon-hyun commented Oct 17, 2020

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 17, 2020

Uh oh!

SparkQA commented Oct 17, 2020

Uh oh!

dongjoon-hyun commented Oct 17, 2020

Uh oh!

Uh oh!

[SPARK-33173][CORE][TESTS] Use `eventually` to check `numOnTaskFailed` in PluginContainerSuite #30072

[SPARK-33173][CORE][TESTS] Use `eventually` to check `numOnTaskFailed` in PluginContainerSuite #30072

dongjoon-hyun commented Oct 17, 2020 •

edited

Loading