Skip to content

Conversation

@jkleckner
Copy link

What changes were proposed in this pull request?

This is a straight application of #28423 onto branch-3.0

Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed.

For more relevant information see here: fabric8io/kubernetes-client#1075

Does this PR introduce any user-facing change?

No

How was this patch tested?

This was tested in #28423 by running spark-submit to a k8s cluster.

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

cc @holdenk

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127914 has finished for PR 29533 at commit 8ad475b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32540/

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32540/

@SQUIDwarrior
Copy link

SQUIDwarrior commented Aug 27, 2020

We also see this problem with Spark 2.4.X so I would very much like to see this landed for both versions (I was redirected here from the other PR).

@holdenk
Copy link
Contributor

holdenk commented Aug 27, 2020

Ok sounds like something to keep in mind.
Jenkins retest this please.

@jkleckner
Copy link
Author

This comment exists merely to link this PR with PR #29496 intended for Spark 2.4 branch

@dongjoon-hyun
Copy link
Member

Any review update, @holdenk ?

@holdenk
Copy link
Contributor

holdenk commented Aug 29, 2020

Jenkins retest this please.

@holdenk
Copy link
Contributor

holdenk commented Aug 29, 2020

I haven’t had a chance to look, but given it’s a backport if you have don’t feel like you need to wait for me.

@SparkQA
Copy link

SparkQA commented Aug 29, 2020

Test build #128023 has finished for PR 29533 at commit 8ad475b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32650/

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32650/

@jkleckner
Copy link
Author

jkleckner commented Aug 30, 2020

Is there some way to run that expanded test information for this single failed R integration test as described in the message?

@stijndehaes Do you have some insight into the test suite changes and what made it succeed on the master branch as discussed in #28423 ?

- Run SparkR on simple dataframe.R example *** FAILED ***
  The code passed to eventually never returned normally. Attempted 70 times over 2.0003406448333334 minutes. Last failure message: false was not true. (KubernetesSuite.scala:315)
Run completed in 12 minutes, 36 seconds.
Total number of tests run: 19
Suites: completed 2, aborted 0
Tests: succeeded 18, failed 1, canceled 0, ignored 0, pending 0
*** 1 TEST FAILED ***
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 3.0.2-SNAPSHOT:
[INFO] 
[INFO] Spark Project Parent POM ........................... SUCCESS [  3.625 s]
[INFO] Spark Project Tags ................................. SUCCESS [  8.191 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  3.628 s]
[INFO] Spark Project Networking ........................... SUCCESS [  5.584 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  2.920 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 10.042 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  3.774 s]
[INFO] Spark Project Core ................................. SUCCESS [02:20 min]
[INFO] Spark Project Kubernetes Integration Tests ......... FAILURE [15:49 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  18:48 min
[INFO] Finished at: 2020-08-29T17:24:33-07:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.0:test (integration-test) on project spark-kubernetes-integration-tests_2.12: There are test failures -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :spark-kubernetes-integration-tests_2.12

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Test build #128053 has finished for PR 29533 at commit 745ee6b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32679/

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32679/

@jkleckner jkleckner force-pushed the backport-SPARK-24266-to-branch-3.0 branch from 745ee6b to 8ad475b Compare August 30, 2020 22:27
@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Test build #128054 has finished for PR 29533 at commit 8ad475b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkleckner jkleckner force-pushed the backport-SPARK-24266-to-branch-3.0 branch from 8ad475b to 6449efa Compare August 30, 2020 22:48
@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Test build #128055 has finished for PR 29533 at commit 6449efa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32680/

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32680/

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32681/

@SparkQA
Copy link

SparkQA commented Aug 31, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/32681/

@jkleckner
Copy link
Author

Does the integration test flakiness described by @holdenk in SPARK-32354 [1] apply to this build?

[1] https://issues.apache.org/jira/browse/SPARK-32354

@holdenk
Copy link
Contributor

holdenk commented Aug 31, 2020

Huh did someone re-enable the R tests?
Anyways Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Oct 6, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34034/

@jkleckner
Copy link
Author

@holdenk Ok, integration tests pass.

It was just a two word deletion...

@jkleckner
Copy link
Author

@dongjoon-hyun Is this ok to merge?

with BeforeAndAfterAll with BeforeAndAfter with BasicTestsSuite with SecretsTestsSuite
with PythonTestsSuite with ClientModeTestsSuite with PodTemplateSuite with PVTestsSuite
with DepsTestsSuite with RTestsSuite with Logging with Eventually with Matchers {
with DepsTestsSuite with Logging with Eventually with Matchers {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this. branch-3.0 doesn't have R test issue.

KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- Run SparkR on simple dataframe.R example
Run completed in 8 minutes, 12 seconds.
Total number of tests run: 19
Suites: completed 2, aborted 0
Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun We have a review loop.
@holdenk asked me to turn off the SparkR tests - see above.
And @holdenk created a story for master to make the integration tests work properly for SparkR here [1].

[1] https://issues.apache.org/jira/projects/SPARK/issues/SPARK-32354

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @@jkleckner .

I briefly take a look. The change on KubernetesSuite.scala should be reverted from this PR.

@redsk
Copy link
Contributor

redsk commented Oct 22, 2020

@jkleckner I tried this patch in production but it does not seem to work.

Distribution:

git clone https://github.com/apache/spark.git
git checkout branch-3.0
git fetch origin pull/29533/head:backport-SPARK-24266-to-branch-3.0
git checkout backport-SPARK-24266-to-branch-3.0
git checkout -b rebased-backport-SPARK-24266-to-branch-3.0
git rebase branch-3.0
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
./dev/make-distribution.sh --name spark-3.0-24266 --tgz -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

Image creation:

./bin/docker-image-tool.sh -r my-registry -t my-tag -n -u 0 -b java_image_tag=11-jre-slim build
./bin/docker-image-tool.sh -r my-registry -t my-tag push

When I execute my long-running spark application I get

...
20/10/22 17:30:20 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 1543007015 (1543067888)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

I'm not sure if I made a mistake or there's a problem in the patch. Thanks

@jkleckner
Copy link
Author

I tried this patch in production but it does not seem to work.

...
I'm not sure if I made a mistake or there's a problem in the patch. Thanks

Your rebase of the patch looks correct when I tried it out.

Unfortunately, there must be something in the patches to k8s code after 2.4 and before the fix #28423 was merged.

My interest in landing this patch has been to unblock #29496 due to the backporting policies for Spark 2.4 which is what we use and I don't have a setup to test this for the 3.0 branch.

@redsk if you would like to look into that, it would be helpful.

The candidate patch sets that are in the k8s can be viewed with something like this:
git log --oneline branch-3.0..0432379f9923768a767566e9ac5a4021cfe8d052 -- resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s

0432379f99 [SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s
c8f3bd861d [SPARK-31696][K8S] Support driver service annotation in K8S
85dad37f69 [SPARK-31601][K8S] Fix spark.kubernetes.executor.podNamePrefix to work
b8ccd75524 [SPARK-29905][K8S] Improve pod lifecycle manager behavior with dynamic allocation
7699f765f5 [SPARK-31394][K8S] Adds support for Kubernetes NFS volume mounts
ed06d98044 [SPARK-25355][K8S] Add proxy user to driver if present on spark-submit
1254c88034 [SPARK-31118][K8S][DOC] Add version information to the configuration of K8S
d273a2bb0f [SPARK-20628][CORE][K8S] Start to improve Spark decommissioning & preemption support
f9f06eee98 [SPARK-30122][K8S] Support spark.kubernetes.authenticate.executor.serviceAccountName
86fdb818bf [SPARK-30715][K8S] Bump fabric8 to 4.7.1

@stijndehaes can you guide as to which of these might affect this backport?

@stijndehaes
Copy link
Contributor

stijndehaes commented Oct 27, 2020

@redsk @jkleckner The error line you are seeing comes from the class ExecutorPodsWatchSnapshotSource this is somewhere else in the code. I thought there was another mechanism for the driver to executor watches.

This also looks like driver logs? The fix here is in the spark-submit application, you should watch the logs of that i.s.o. the driver. Can you tell me if these were driver logs or spark-submit logs?

@dongjoon-hyun
Copy link
Member

Hi, @jkleckner and all.
Is there any updates on this PR?

@jkleckner
Copy link
Author

@dongjoon-hyun Since I don't have a test environment to try out or observe the problem mentioned by @redsk it will have to be taken up by someone like @redsk.

I am fully on a critical path for the next many weeks and the back port to 2.4 is working fine for us so I can't spend any more time on this now.

And as mentioned by @stijndehaes , this could very well be a separate issue from the fix that 24266 addresses.

I did spend some time last week and thought the patch sets that upgrade fabric8 which had fixes that might plausibly explain the problem seen by @redsk .

I could revert the one line patch that you wanted to revert that was requested by @holdenk but I feel caught in the middle and would prefer that you both agree.

…ged from k8s

Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed.

For more relevant information see here: fabric8io/kubernetes-client#1075

No

Running spark-submit to a k8s cluster.

Not sure how to make an automated test for this. If someone can help me out that would be great.

Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change.

Address review comment to fully qualify import scala.util.control

Rebase on branch-3.0 to fix SparkR integration test.
@jkleckner
Copy link
Author

I reverted the SparkR per your instruction and also rebased to branch-3.0.

@jkleckner jkleckner force-pushed the backport-SPARK-24266-to-branch-3.0 branch from 0605745 to 9015c82 Compare November 3, 2020 05:40
@dongjoon-hyun
Copy link
Member

Thanks, @jkleckner . For the SparkR part, we can ignore it in this PR if it fails again. And, disabling should be handled by a separate PR if we do it in branch-3.0.

I reverted the SparkR per your instruction and also rebased to branch-3.0.

@SparkQA
Copy link

SparkQA commented Nov 3, 2020

Test build #130558 has finished for PR 29533 at commit 9015c82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 3, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35158/

@SparkQA
Copy link

SparkQA commented Nov 3, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35158/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, all!

Merged to branch-3.0 for Apache Spark 3.0.2.

I also hit the same issue, ExecutorPodsWatchSnapshotSource, reported by @redsk without this patch on the latest branch-3.0. I agree that is another watcher issue.

dongjoon-hyun pushed a commit that referenced this pull request Nov 3, 2020
… changed from k8s

### What changes were proposed in this pull request?

This is a straight application of #28423 onto branch-3.0

Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed.

For more relevant information see here: fabric8io/kubernetes-client#1075

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This was tested in #28423 by running spark-submit to a k8s cluster.

Closes #29533 from jkleckner/backport-SPARK-24266-to-branch-3.0.

Authored-by: Stijn De Haes <stijndehaes@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

@redsk Could you file a new JIRA for your observation, please?

@jkleckner
Copy link
Author

@dongjoon-hyun Thank you for the merge and hanging in there. Hopefully @shockdm will revive the 2.4 branch fix kicked off in #29496

@redsk
Copy link
Contributor

redsk commented Nov 4, 2020

@dongjoon-hyun I've created SPARK-33349 as requested. Thanks

@dongjoon-hyun
Copy link
Member

Thank you so much, @redsk !

jkleckner pushed a commit to jkleckner/spark that referenced this pull request Nov 6, 2020
… changed from k8s

This is a backport of apache#29533 from master.

It includes the shockdm/pull/1 which has been squashed and the import review
comment include.

It has also been rebased to branch-2.4
jkleckner pushed a commit to jkleckner/spark that referenced this pull request Nov 10, 2020
… changed from k8s

This is a backport of apache#29533 from master.

It includes the shockdm/pull/1 which has been squashed and the import review
comment include.

It has also been rebased to branch-2.4

Address review comments.
jkleckner pushed a commit to jkleckner/spark that referenced this pull request Nov 10, 2020
… changed from k8s

This is a backport of apache#29533 from master.

It includes the shockdm/pull/1 which has been squashed and the import review
comment include.

It has also been rebased to branch-2.4

Address review comments.
zanitete added a commit to nagra-insight/spark that referenced this pull request May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants