Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark submit in operator fails #1277

Closed
LeonardAukea opened this issue Jun 4, 2021 · 21 comments
Closed

Spark submit in operator fails #1277

LeonardAukea opened this issue Jun 4, 2021 · 21 comments

Comments

@LeonardAukea
Copy link

LeonardAukea commented Jun 4, 2021

Hi all, I seem to be having some issues with the getting a spark application up and running: hittig issues like this:

21/06/04 07:42:53 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
21/06/04 07:42:53 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  for kind: [Pod]  with name: [null]  in namespace: [my-ns]  failed. 

I have istio on the cluster hence I also tired the following settings with no avail:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: my-ns
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v3.1.1"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-v3.1.1.jar"
  sparkVersion: "3.1.1"
  batchScheduler: "volcano"   #Note: the batch scheduler name must be specified with `volcano`
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"     
    labels:
      version: 3.1.1
    annotations:
      sidecar.istio.io/inject: "false"            
    serviceAccount: default-editor
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"    
    labels:
      version: 3.1.1
    annotations:
      sidecar.istio.io/inject: "false"        
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

So somehow it seems like the application is not able to communicate with the kubernetes API. the default-editior sa has the following rules:

- apiGroups:
  - sparkoperator.k8s.io
  resources:
  - sparkapplications
  - scheduledsparkapplications
  - sparkapplications/status
  - scheduledsparkapplications/status
  verbs:
  - '*'
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["*"]
- apiGroups: [""]
  resources: ["services"]
  verbs: ["*"]  

i also added the authorizationpolicy to allow traffic for for webhook & operator:

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
 name: spark-operator
 namespace: spark
spec:
 selector:
   matchLabels:
     app.kubernetes.io/name: spark-operator
 rules:
 - {}

If anyone has seen this before or has any valuable pointers. that would be much appreciated.

k8s: 1.19
version: "v1beta2-1.2.3-3.1.1"
chart: 1.1.3
istio: 1.19

This PROTOCOL_ERROR might also be a pointer towards the underlying issue:

 at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
  at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:349)
  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:84)
  at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:139)
  at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
  at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2611)
  at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
  at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
  at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
  at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: okhttp3.internal.http2.StreamResetException: stream was reset: PROTOCOL_ERROR
@LeonardAukea
Copy link
Author

I've been trying to get DEBUG logs out of driver in the hope of gaining more insight in the issue by setting:

spec:
  sparkConfigMap: log4j-props

and generating the a cm using:

configMapGenerator:
  - files:
    - config/log4j.properties
    name: log4j-props  
generatorOptions:
 disableNameSuffixHash: true

But I can't that to work either:

**Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties**

@LeonardAukea
Copy link
Author

LeonardAukea commented Jun 4, 2021

I found that for k8s 1.19.1, the kubernetes-client has to be of a version greater than >= 4.13.1 compatibility-matrix. Looking at the deps in 1.3 branch for spark I see the following:

https://github.com/apache/spark/blob/252dfd961189923e52304413036e0051346ee8e1/dev/deps/spark-deps-hadoop-3.2-hive-2.3#L170

So, kubernetes-client 4.12.0 Is used. So to confirm, it seems that spark does not yet support k8s 1.19. would be great if someone your verify this.

@LeonardAukea
Copy link
Author

LeonardAukea commented Jun 7, 2021

@LeonardAukea
Copy link
Author

The issue remains even after testing with building spark from master. Got debug logs set up as well for further details:

21/06/09 08:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/06/09 08:31:21 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
21/06/09 08:31:21 DEBUG Config: Trying to configure client from Kubernetes config...
21/06/09 08:31:21 DEBUG Config: Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
21/06/09 08:31:21 DEBUG Config: Trying to configure client from service account...
21/06/09 08:31:21 DEBUG Config: Found service account host and port: 100.64.0.1:443
21/06/09 08:31:21 DEBUG Config: Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt].
21/06/09 08:31:21 DEBUG Config: Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
21/06/09 08:31:21 DEBUG Config: Trying to configure client namespace from Kubernetes service account namespace path...
21/06/09 08:31:21 DEBUG Config: Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
21/06/09 08:31:21 DEBUG Config: Trying to configure client from Kubernetes config...
21/06/09 08:31:21 DEBUG Config: Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
21/06/09 08:31:21 DEBUG Config: Trying to configure client from service account...
21/06/09 08:31:21 DEBUG Config: Found service account host and port: 100.64.0.1:443
21/06/09 08:31:21 DEBUG Config: Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt].
21/06/09 08:31:21 DEBUG Config: Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
21/06/09 08:31:21 DEBUG Config: Trying to configure client namespace from Kubernetes service account namespace path...
21/06/09 08:31:21 DEBUG Config: Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
21/06/09 08:31:21 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
21/06/09 08:31:21 DEBUG UserGroupInformation: hadoop login
21/06/09 08:31:21 DEBUG UserGroupInformation: hadoop login commit
21/06/09 08:31:21 DEBUG UserGroupInformation: using local user:UnixPrincipal: root
21/06/09 08:31:21 DEBUG UserGroupInformation: Using user: "UnixPrincipal: root" with name root
21/06/09 08:31:21 DEBUG UserGroupInformation: User entry: "root"
21/06/09 08:31:21 DEBUG UserGroupInformation: UGI loginUser:root (auth:SIMPLE)
21/06/09 08:31:21 DEBUG HadoopDelegationTokenManager: Using the following builtin delegation token providers: hadoopfs, hbase.
21/06/09 08:31:21 INFO KubernetesClientUtils: Spark configuration files loaded from Some(/opt/spark/conf) : log4j.properties
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  for kind: [Pod]  with name: [null]  in namespace: [my-ns]  failed.

@LeonardAukea LeonardAukea changed the title Spark application can't reach the kubernetesAPI (driver) Spark submit in operator fails in separate namespace Jun 9, 2021
@LeonardAukea LeonardAukea changed the title Spark submit in operator fails in separate namespace Spark submit in operator fails Jun 9, 2021
@LeonardAukea
Copy link
Author

So the issue is related to fabric8io/kubernetes-client#2212 (comment)

In order to make it work. we had to add the following to the spark-operator, driver & executor:

        env:
        - name: HTTP2_DISABLE # https://github.com/fabric8io/kubernetes-client/issues/2212#issuecomment-628551315
          value: "true"   

@sel-vcc
Copy link

sel-vcc commented Jun 9, 2021

fabric8io/kubernetes-client#3176 (comment) is a good write-up of the root-cause.

In short, fabric8's kubernetes-client cannot communicate with a Kubernetes API server where the weak TLS cipher TLS_RSA_WITH_AES_256_GCM_SHA384 has been disabled. Disabling HTTP2 is a work-around.

@slachiewicz
Copy link

@LeonardAukea could You try to run with KUBERNETES_TLS_VERSIONS env variable set to TLSv1.2,TLSv1.3 ?
I expect the kubernetes-client is currently only using TLS1.2 only and server/Istilo only expects secure ChipherSuites and newer try to use Ciphers for 1.3

@nnringit
Copy link

Thanks @slachiewicz , setting KUBERNETES_TLS_VERSIONS=TLSv1.2,TLSv1.3also worked.

@DoniyorTuremuratov
Copy link

@slachiewicz @nnringit I am facing the same error when submitting spark app to Kubernetes. Could you, please, tell me where I should change or add KUBERNETES_TLS_VERSIONS=TLSv1.2,TLSv1.3?

@satyamsah
Copy link

satyamsah commented Mar 11, 2022

Hi, I tried both the option:

  1. version change in spark-operator,driver,exectutor with env variable as
        env:
        - name: KUBERNETES_TLS_VERSIONS 
          value: "TLSv1.2" 

`
3. with env variable as HTTP2_DISABLE="true" in spark-operator, driver, exectutor

        env:
        - name: HTTP2_DISABLE 
          value: "true"   

But both the options are not able to resolve the issue. Can someone suggest me what am I missing?

@JunaidChaudry
Copy link
Contributor

JunaidChaudry commented Mar 27, 2023

@LeonardAukea @DoniyorTuremuratov @slachiewicz I am also facing the same issue with the latest spark-operator... I tried setting both KUBERNETES_TLS_VERSIONS, as well as HTT2_DISABLE env variables in operator, driver, and executor. But none of them seem to work. Is there any other recommended approach that I can try?

image

For what its worth, it might be related to the fact that spark-operator still sets up the spark-operator with kubernetes client version 4.12.0 which really only provides full support up to kubernetes version 1.18, with minimal support up to 1.22, and no support for versions 1.23+ compatibility matrix.

root@spark-operator-674c5dc89f-htl6p:/opt/spark/work-dir# ls ../jars | grep kubernetes
kubernetes-client-4.12.0.jar
kubernetes-model-admissionregistration-4.12.0.jar
kubernetes-model-apiextensions-4.12.0.jar
kubernetes-model-apps-4.12.0.jar
kubernetes-model-autoscaling-4.12.0.jar
kubernetes-model-batch-4.12.0.jar
kubernetes-model-certificates-4.12.0.jar
kubernetes-model-common-4.12.0.jar
kubernetes-model-coordination-4.12.0.jar
kubernetes-model-core-4.12.0.jar
kubernetes-model-discovery-4.12.0.jar
kubernetes-model-events-4.12.0.jar
kubernetes-model-extensions-4.12.0.jar
kubernetes-model-metrics-4.12.0.jar
kubernetes-model-networking-4.12.0.jar
kubernetes-model-policy-4.12.0.jar
kubernetes-model-rbac-4.12.0.jar
kubernetes-model-scheduling-4.12.0.jar
kubernetes-model-settings-4.12.0.jar
kubernetes-model-storageclass-4.12.0.jar
spark-kubernetes_2.12-3.1.1.jar

The latest kubernetes version is 1.26, with spark 3.3.0 even supporting kubernetes-client 5.12.2. Is there a way to at least make sure that the spark-operator uses kubernetes-client 5.12.2, and try with that to see if that fixes the issue ?

Below is my error for visibility:

Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  for kind: [Pod]  with name: [null]  in namespace: [spark-operator]  failed.
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:349)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:84)
        at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:139)
        at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
        at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
        at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2611)
        at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
        at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketTimeoutException: timeout
        at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:672)
        at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:680)
        at okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.java:153)
        at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:131)
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:135)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.OIDCTokenRefreshInterceptor.intercept(OIDCTokenRefreshInterceptor.java:41)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:151)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
        at okhttp3.RealCall.execute(RealCall.java:93)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:490)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:451)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:252)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:879)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:341)
        ... 14 more
23/03/27 16:15:03 INFO ShutdownHookManager: Shutdown hook called
23/03/27 16:15:03 INFO ShutdownHookManager: Deleting directory /tmp/spark-f18eea0e-6437-4444-b2fd-e429aedbf6b6

@harshal-zetaris
Copy link

@JunaidChaudry I'm exactly stuck at where you are. Did you get a solution to this problem?

@harshal-zetaris
Copy link

@LeonardAukea Can you specify how will one add an env var to Spark Operator. I've added the HTTP2_DISABLE var to driver and executor config, but has had no effect. How did you add it to the Operator itself?

@JunaidChaudry
Copy link
Contributor

@harshal-zetaris did you enable webhooks? I had to enable webhooks, and configure the webhook.port to be 443 from the default 8000

@JunaidChaudry
Copy link
Contributor

I had the webhooks enabled, but didn't have the port configured. This solved it #1708 (comment)

@harshal-zetaris
Copy link

Wow! That worked @JunaidChaudry . However I'm confused as to why.

I literally spun a whole new EKS cluster just in March this year and have been using that as our official QA cluster. Deployments there are still going as smooth as butter.

I suddenly started getting into precisely this problem after I spun another cluster a couple of days back. The interesting thing is deployments on the old cluster are still working fine.

I read through the conversation in your linked issue and indeed the new version of node AMI has been released on May 1, post which this issue started manifesting.

Thank You so much for your help.

@JunaidChaudry
Copy link
Contributor

I am in the exact same boat as you. It has something to do with the AWS AMI update that was received in late March. I had multiple EKS clusters, with the webhook working out of the box on all of them... UNTIL I restarted my EKS nodes and they started running with the newer AWS AMI version.
I did confirm that it was unrelated to the actual kubernetes version (all versions were behaving the same)

@gangahiremath
Copy link

@JunaidChaudry @harshal-zetaris @satyamsah , any luck with fix for the SocketTimeoutException/K8SClientException

@dimensie
Copy link

@JunaidChaudry hi, I got the same question with " Operation: [create] for kind: [Pod] with name: [null] in namespace: [spark-operator] failed". I didn't use helm to intall the operator, instead of pulling operator iamge and loading to container platform. I'm not sure if I've been enable webhook. Do you have any idea? Thanks.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants