-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33173][CORE][TESTS] Use eventually
to check numOnTaskFailed
in PluginContainerSuite
#30072
Conversation
cc @mridulm and @tgravescs |
eventually
to check numOnTaskFailed
eventually
to check numOnTaskFailed
in PluginContainerSuite
Kubernetes integration test starting |
Could you review this, @viirya ? |
Kubernetes integration test status success |
This is very flaky. I saw this at least 3 places (Jenkins, my another PR and @sunchao 's PR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Let's make it pass first.
Thanks, @viirya and @HyukjinKwon . |
Test build #129923 has finished for PR 30072 at commit
|
Thank you, @viirya , @HyukjinKwon , @mridulm . |
### What changes were proposed in this pull request? Use `local[2]` to let tasks launch at the same time. And change counters (`numOnTaskXXX`) to `AtomicInteger` type to ensure thread safe. ### Why are the changes needed? The test is still flaky after the fix #30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642 And it's easy to reproduce if you test it multiple times (e.g. 100) locally. The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one. The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling `PluginContainer.onTaskStart`, which leads to the test failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested manually after the fix and the test is no longer flaky. Closes #30823 from Ngone51/debug-flaky-spark-33088. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? Use `local[2]` to let tasks launch at the same time. And change counters (`numOnTaskXXX`) to `AtomicInteger` type to ensure thread safe. ### Why are the changes needed? The test is still flaky after the fix #30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642 And it's easy to reproduce if you test it multiple times (e.g. 100) locally. The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one. The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling `PluginContainer.onTaskStart`, which leads to the test failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested manually after the fix and the test is no longer flaky. Closes #30823 from Ngone51/debug-flaky-spark-33088. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 15616f4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…` in PluginContainerSuite ### What changes were proposed in this pull request? This PR aims to use `eventually` to fix the flakiness of the test case `SPARK-33088: executor failed tasks trigger plugin calls`. ### Why are the changes needed? The test case checks like the following. ```scala assert(TestSparkPlugin.executorPlugin.numOnTaskStart == 2) assert(TestSparkPlugin.executorPlugin.numOnTaskSucceeded == 0) assert(TestSparkPlugin.executorPlugin.numOnTaskFailed == 2) ``` Although first and second passed, the third can fail. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11/lastCompletedBuild/testReport/org.apache.spark.internal.plugin/PluginContainerSuite/SPARK_33088__executor_failed_tasks_trigger_plugin_calls/ - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129919/testReport/ ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 1 did not equal 2 at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) at org.apache.spark.internal.plugin.PluginContainerSuite.$anonfun$new$8(PluginContainerSuite.scala:161) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This only improves the robustness. Closes apache#30072 from dongjoon-hyun/SPARK-33173. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…` in PluginContainerSuite ### What changes were proposed in this pull request? This PR aims to use `eventually` to fix the flakiness of the test case `SPARK-33088: executor failed tasks trigger plugin calls`. ### Why are the changes needed? The test case checks like the following. ```scala assert(TestSparkPlugin.executorPlugin.numOnTaskStart == 2) assert(TestSparkPlugin.executorPlugin.numOnTaskSucceeded == 0) assert(TestSparkPlugin.executorPlugin.numOnTaskFailed == 2) ``` Although first and second passed, the third can fail. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11/lastCompletedBuild/testReport/org.apache.spark.internal.plugin/PluginContainerSuite/SPARK_33088__executor_failed_tasks_trigger_plugin_calls/ - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129919/testReport/ ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 1 did not equal 2 at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) at org.apache.spark.internal.plugin.PluginContainerSuite.$anonfun$new$8(PluginContainerSuite.scala:161) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This only improves the robustness. Closes apache#30072 from dongjoon-hyun/SPARK-33173. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR aims to use
eventually
to fix the flakiness of the test caseSPARK-33088: executor failed tasks trigger plugin calls
.Why are the changes needed?
The test case checks like the following.
Although first and second passed, the third can fail.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
This only improves the robustness.