-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24266][K8S][3.0] Restart the watcher when we receive a version changed from k8s #29533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24266][K8S][3.0] Restart the watcher when we receive a version changed from k8s #29533
Conversation
|
ok to test |
|
cc @holdenk |
|
Test build #127914 has finished for PR 29533 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
We also see this problem with Spark 2.4.X so I would very much like to see this landed for both versions (I was redirected here from the other PR). |
|
Ok sounds like something to keep in mind. |
|
This comment exists merely to link this PR with PR #29496 intended for Spark 2.4 branch |
|
Any review update, @holdenk ? |
|
Jenkins retest this please. |
|
I haven’t had a chance to look, but given it’s a backport if you have don’t feel like you need to wait for me. |
|
Test build #128023 has finished for PR 29533 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Is there some way to run that expanded test information for this single failed R integration test as described in the message? @stijndehaes Do you have some insight into the test suite changes and what made it succeed on the master branch as discussed in #28423 ? |
|
Test build #128053 has finished for PR 29533 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
745ee6b to
8ad475b
Compare
|
Test build #128054 has finished for PR 29533 at commit
|
8ad475b to
6449efa
Compare
|
Test build #128055 has finished for PR 29533 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Does the integration test flakiness described by @holdenk in SPARK-32354 [1] apply to this build? |
|
Huh did someone re-enable the R tests? |
|
Kubernetes integration test status success |
|
@holdenk Ok, integration tests pass. It was just a two word deletion... |
|
@dongjoon-hyun Is this ok to merge? |
| with BeforeAndAfterAll with BeforeAndAfter with BasicTestsSuite with SecretsTestsSuite | ||
| with PythonTestsSuite with ClientModeTestsSuite with PodTemplateSuite with PVTestsSuite | ||
| with DepsTestsSuite with RTestsSuite with Logging with Eventually with Matchers { | ||
| with DepsTestsSuite with Logging with Eventually with Matchers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this. branch-3.0 doesn't have R test issue.
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- Run SparkR on simple dataframe.R example
Run completed in 8 minutes, 12 seconds.
Total number of tests run: 19
Suites: completed 2, aborted 0
Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun We have a review loop.
@holdenk asked me to turn off the SparkR tests - see above.
And @holdenk created a story for master to make the integration tests work properly for SparkR here [1].
[1] https://issues.apache.org/jira/projects/SPARK/issues/SPARK-32354
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @@jkleckner .
I briefly take a look. The change on KubernetesSuite.scala should be reverted from this PR.
|
@jkleckner I tried this patch in production but it does not seem to work. Distribution: Image creation: When I execute my long-running spark application I get I'm not sure if I made a mistake or there's a problem in the patch. Thanks |
Your rebase of the patch looks correct when I tried it out. Unfortunately, there must be something in the patches to k8s code after 2.4 and before the fix #28423 was merged. My interest in landing this patch has been to unblock #29496 due to the backporting policies for Spark 2.4 which is what we use and I don't have a setup to test this for the 3.0 branch. @redsk if you would like to look into that, it would be helpful. The candidate patch sets that are in the k8s can be viewed with something like this: @stijndehaes can you guide as to which of these might affect this backport? |
|
@redsk @jkleckner The error line you are seeing comes from the class This also looks like driver logs? The fix here is in the spark-submit application, you should watch the logs of that i.s.o. the driver. Can you tell me if these were driver logs or spark-submit logs? |
|
Hi, @jkleckner and all. |
|
@dongjoon-hyun Since I don't have a test environment to try out or observe the problem mentioned by @redsk it will have to be taken up by someone like @redsk. I am fully on a critical path for the next many weeks and the back port to 2.4 is working fine for us so I can't spend any more time on this now. And as mentioned by @stijndehaes , this could very well be a separate issue from the fix that 24266 addresses. I did spend some time last week and thought the patch sets that upgrade fabric8 which had fixes that might plausibly explain the problem seen by @redsk . I could revert the one line patch that you wanted to revert that was requested by @holdenk but I feel caught in the middle and would prefer that you both agree. |
…ged from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Address review comment to fully qualify import scala.util.control Rebase on branch-3.0 to fix SparkR integration test.
|
I reverted the SparkR per your instruction and also rebased to branch-3.0. |
0605745 to
9015c82
Compare
|
Thanks, @jkleckner . For the SparkR part, we can ignore it in this PR if it fails again. And,
|
|
Test build #130558 has finished for PR 29533 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, all!
Merged to branch-3.0 for Apache Spark 3.0.2.
I also hit the same issue, ExecutorPodsWatchSnapshotSource, reported by @redsk without this patch on the latest branch-3.0. I agree that is another watcher issue.
… changed from k8s ### What changes were proposed in this pull request? This is a straight application of #28423 onto branch-3.0 Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This was tested in #28423 by running spark-submit to a k8s cluster. Closes #29533 from jkleckner/backport-SPARK-24266-to-branch-3.0. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
@redsk Could you file a new JIRA for your observation, please? |
|
@dongjoon-hyun Thank you for the merge and hanging in there. Hopefully @shockdm will revive the 2.4 branch fix kicked off in #29496 |
|
@dongjoon-hyun I've created SPARK-33349 as requested. Thanks |
|
Thank you so much, @redsk ! |
… changed from k8s This is a backport of apache#29533 from master. It includes the shockdm/pull/1 which has been squashed and the import review comment include. It has also been rebased to branch-2.4
… changed from k8s This is a backport of apache#29533 from master. It includes the shockdm/pull/1 which has been squashed and the import review comment include. It has also been rebased to branch-2.4 Address review comments.
… changed from k8s This is a backport of apache#29533 from master. It includes the shockdm/pull/1 which has been squashed and the import review comment include. It has also been rebased to branch-2.4 Address review comments.
What changes were proposed in this pull request?
This is a straight application of #28423 onto branch-3.0
Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed.
For more relevant information see here: fabric8io/kubernetes-client#1075
Does this PR introduce any user-facing change?
No
How was this patch tested?
This was tested in #28423 by running spark-submit to a k8s cluster.