Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark-operator v2.0.2 - listen tcp :443: bind: permission denied #2331

Open
1 task
karanalang opened this issue Nov 21, 2024 · 2 comments
Open
1 task

spark-operator v2.0.2 - listen tcp :443: bind: permission denied #2331

karanalang opened this issue Nov 21, 2024 · 2 comments
Labels
kind/bug Something isn't working

Comments

@karanalang
Copy link

What happened?

  • βœ‹ I have searched the open/closed issues and my issue is not listed.
    I'm trying to install spark-operator on k8s (v1.28), and running in to issues πŸ‘

Command -

helm upgrade --install spark-operator spark-operator/spark-operator \
  --namespace so350 \
  --set image.tag=2.0.2 \
  --create-namespace \
  --set webhook.enable=true \
  --set webhook.port=443 \
  --set webhook.namespaceSelector="spark-webhook-enabled=true" \
  --set webhook.containerSecurityContext.privileged=true \
  --set webhook.containerSecurityContext.capabilities.add[0]=NET_BIND_SERVICE \
  --set logLevel=debug \
  --set enableResourceQuotaEnforcement=true \
  --set webhook.failOnError=true \
  --set controller.resources.limits.cpu=100m \
  --set controller.resources.limits.memory=200Mi \
  --set controller.resources.requests.cpu=50m \
  --set controller.resources.requests.memory=100Mi \
  --set webhook.resources.limits.cpu=100m \
  --set webhook.resources.limits.memory=200Mi \
  --set webhook.resources.requests.cpu=50m \
  --set webhook.resources.requests.memory=100Mi \
  --set "sparkJobNamespaces={spark-apps}" \
  --set webhook.containerSecurityContext.runAsUser=0

spark-controller pod is started but webhook pod is failing -

NAME                                             READY   STATUS    RESTARTS      AGE
pod/spark-operator-controller-688c7c9955-tkdpf   1/1     Running   0             3m15s
pod/spark-operator-webhook-567bd94f66-tg567      0/1     Error     5 (94s ago)   3m15s

NAME                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/spark-operator-webhook-svc   ClusterIP   10.108.242.219   <none>        443/TCP   3m15s

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/spark-operator-controller   1/1     1            1           3m15s
deployment.apps/spark-operator-webhook      0/1     1            0           3m15s

NAME                                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/spark-operator-controller-688c7c9955   1         1         1       3m15s
replicaset.apps/spark-operator-webhook-567bd94f66      1         1         0       3m15s

Logs from webhook pod -

(base) Karans-MacBook-Pro:~ karanalang$ kc logs -f pod/spark-operator-webhook-567bd94f66-tg567  -n so350
++ id -u
+ uid=185
++ id -g
+ gid=185
+ set +e
++ getent passwd 185
+ uidentry=spark:x:185:185::/home/spark:/bin/sh
+ set -e
+ [[ -z spark:x:185:185::/home/spark:/bin/sh ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator webhook start --zap-log-level=info --namespaces=default --webhook-secret-name=spark-operator-webhook-certs --webhook-secret-namespace=so350 --webhook-svc-name=spark-operator-webhook-svc --webhook-svc-namespace=so350 --webhook-port=443 --mutating-webhook-name=spark-operator-webhook --validating-webhook-name=spark-operator-webhook --enable-metrics=true --metrics-bind-address=:8080 --metrics-endpoint=/metrics --metrics-prefix= --metrics-labels=app_type --leader-election=true --leader-election-lock-name=spark-operator-webhook-lock --leader-election-lock-namespace=so350
Spark Operator Version: 2.0.2+HEAD+unknown
Build Date: 2024-10-11T01:46:23+00:00
Git Commit ID: 
Git Tree State: clean
Go Version: go1.23.1
Compiler: gc
Platform: linux/amd64
2024-11-21T20:56:37.838Z	INFO	webhook/start.go:244	Syncing webhook secret	{"name": "spark-operator-webhook-certs", "namespace": "so350"}
2024-11-21T20:56:37.936Z	INFO	webhook/start.go:258	Writing certificates	{"path": "/etc/k8s-webhook-server/serving-certs", "certificate name": "tls.crt", "key name": "tls.key"}
2024-11-21T20:56:38.036Z	INFO	controller-runtime.builder	builder/webhook.go:158	Registering a mutating webhook	{"GVK": "sparkoperator.k8s.io/v1beta2, Kind=SparkApplication", "path": "/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-11-21T20:56:38.036Z	INFO	controller-runtime.webhook	webhook/server.go:183	Registering webhook	{"path": "/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-11-21T20:56:38.036Z	INFO	controller-runtime.builder	builder/webhook.go:189	Registering a validating webhook	{"GVK": "sparkoperator.k8s.io/v1beta2, Kind=SparkApplication", "path": "/validate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-11-21T20:56:38.036Z	INFO	controller-runtime.webhook	webhook/server.go:183	Registering webhook	{"path": "/validate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-11-21T20:56:38.036Z	INFO	controller-runtime.builder	builder/webhook.go:158	Registering a mutating webhook	{"GVK": "sparkoperator.k8s.io/v1beta2, Kind=ScheduledSparkApplication", "path": "/mutate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-11-21T20:56:38.037Z	INFO	controller-runtime.webhook	webhook/server.go:183	Registering webhook	{"path": "/mutate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-11-21T20:56:38.037Z	INFO	controller-runtime.builder	builder/webhook.go:189	Registering a validating webhook	{"GVK": "sparkoperator.k8s.io/v1beta2, Kind=ScheduledSparkApplication", "path": "/validate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-11-21T20:56:38.037Z	INFO	controller-runtime.webhook	webhook/server.go:183	Registering webhook	{"path": "/validate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-11-21T20:56:38.037Z	INFO	controller-runtime.builder	builder/webhook.go:158	Registering a mutating webhook	{"GVK": "/v1, Kind=Pod", "path": "/mutate--v1-pod"}
2024-11-21T20:56:38.037Z	INFO	controller-runtime.webhook	webhook/server.go:183	Registering webhook	{"path": "/mutate--v1-pod"}
2024-11-21T20:56:38.037Z	INFO	controller-runtime.builder	builder/webhook.go:204	skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called	{"GVK": "/v1, Kind=Pod"}
2024-11-21T20:56:38.037Z	INFO	webhook/start.go:320	Starting manager
2024-11-21T20:56:38.038Z	INFO	controller-runtime.metrics	server/server.go:205	Starting metrics server
2024-11-21T20:56:38.038Z	INFO	controller-runtime.metrics	server/server.go:244	Serving metrics server	{"bindAddress": ":8080", "secure": false}
2024-11-21T20:56:38.039Z	INFO	manager/server.go:50	starting server	{"kind": "health probe", "addr": "[::]:8081"}
2024-11-21T20:56:38.039Z	INFO	controller-runtime.webhook	webhook/server.go:191	Starting webhook server
2024-11-21T20:56:38.039Z	INFO	webhook/start.go:358	disabling http/2
2024-11-21T20:56:38.039Z	INFO	controller-runtime.certwatcher	certwatcher/certwatcher.go:161	Updated current TLS certificate
2024-11-21T20:56:38.040Z	INFO	controller-runtime.certwatcher	certwatcher/certwatcher.go:115	Starting certificate watcher
2024-11-21T20:56:38.040Z	INFO	manager/internal.go:534	Stopping and waiting for non leader election runnables
2024-11-21T20:56:38.040Z	INFO	manager/internal.go:538	Stopping and waiting for leader election runnables
2024-11-21T20:56:38.040Z	INFO	manager/internal.go:546	Stopping and waiting for caches
2024-11-21T20:56:38.040Z	INFO	manager/internal.go:550	Stopping and waiting for webhooks
2024-11-21T20:56:38.040Z	INFO	manager/internal.go:553	Stopping and waiting for HTTP servers
I1121 20:56:38.040581      10 leaderelection.go:250] attempting to acquire leader lease so350/spark-operator-webhook-lock...
2024-11-21T20:56:38.041Z	INFO	manager/server.go:43	shutting down server	{"kind": "health probe", "addr": "[::]:8081"}
2024-11-21T20:56:38.041Z	INFO	controller-runtime.metrics	server/server.go:251	Shutting down metrics server with timeout of 1 minute
2024-11-21T20:56:38.041Z	INFO	manager/internal.go:557	Wait completed, proceeding to shutdown the manager
E1121 20:56:38.041688      10 leaderelection.go:332] error retrieving resource lock so350/spark-operator-webhook-lock: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/so350/leases/spark-operator-webhook-lock": context canceled
2024-11-21T20:56:38.041Z	ERROR	webhook/start.go:322	Failed to start manager	{"error": "listen tcp :443: bind: permission denied"}
github.com/kubeflow/spark-operator/cmd/operator/webhook.start
	/workspace/cmd/operator/webhook/start.go:322
github.com/kubeflow/spark-operator/cmd/operator/webhook.NewStartCommand.func2
	/workspace/cmd/operator/webhook/start.go:128
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:989
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041
main.main
	/workspace/cmd/main.go:27
runtime.main
	/usr/local/go/src/runtime/proc.go:272

Pls note - I'd installed v2.0.0-rc.0, it was working fine .. however. running into issues with v2.0.2

Pls help with this.

thanks!

Reproduction Code

No response

Expected behavior

No response

Actual behavior

No response

Environment & Versions

  • Kubernetes Version: 1.28
  • Spark Operator Version: 2.0.2
  • Apache Spark Version: 3.5

Additional context

No response

Impacted by this bug?

Give it a πŸ‘ We prioritize the issues with most πŸ‘

@karanalang karanalang added the kind/bug Something isn't working label Nov 21, 2024
@ChenYi015
Copy link
Contributor

ChenYi015 commented Nov 22, 2024

@karanalang Please use a non-privileged webhook port (default to 9443) if possible, or you will need to run as root or modify the security context for that we have removed all the capabilities to enhance the container security.

@jacobsalway
Copy link
Member

Worth noting I think you want webhook.securityContext rather than webhook.containerSecurityContext. I was able to successfully run on Kind with your Helm values once I changed that.

https://github.com/kubeflow/spark-operator/blob/master/charts/spark-operator-chart/templates/webhook/deployment.yaml#L113-L116

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants