Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while running JobSet in different environments #357

Closed
dejanzele opened this issue Dec 20, 2023 · 9 comments
Closed

Error while running JobSet in different environments #357

dejanzele opened this issue Dec 20, 2023 · 9 comments

Comments

@dejanzele
Copy link
Contributor

dejanzele commented Dec 20, 2023

I tried a couple of approaches to run JobSet and all of them failed with various errors.

Context

  • Machine: Macbook Pro M1, Sonoma 14.2
  • Kubernetes distro: kind v0.20.0 go1.20.5 darwin/arm64
  • Kubernetes version: v1.27.3

Approaches

Local -make run

$ make run    
test -s /Users/zele/Projects/gresearch/jobset/bin/controller-gen && /Users/zele/Projects/gresearch/jobset/bin/controller-gen --version | grep -q v0.11.4 || \
        GOBIN=/Users/zele/Projects/gresearch/jobset/bin go install sigs.k8s.io/controller-tools/cmd/controller-gen@v0.11.4
/Users/zele/Projects/gresearch/jobset/bin/controller-gen \
                rbac:roleName=manager-role output:rbac:artifacts:config=config/components/rbac\
                crd:generateEmbeddedObjectMeta=true output:crd:artifacts:config=config/components/crd/bases\
                webhook output:webhook:artifacts:config=config/components/webhook\
                paths="./..."
go fmt ./...
go vet ./...
go run ./main.go
2023-12-20T15:36:10+01:00       INFO    setup   both healthz and readyz check are finished and configured
2023-12-20T15:36:10+01:00       INFO    setup   starting manager
2023-12-20T15:36:10+01:00       INFO    setup   waiting for the cert generation to complete
2023-12-20T15:36:10+01:00       INFO    starting server {"kind": "health probe", "addr": "[::]:8081"}
2023-12-20T15:36:10+01:00       INFO    controller-runtime.metrics      Starting metrics server
2023-12-20T15:36:10+01:00       INFO    controller-runtime.metrics      Serving metrics server  {"bindAddress": ":8080", "secure": false}
2023-12-20T15:36:11+01:00       INFO    cert-rotation   starting cert rotator controller
2023-12-20T15:36:11+01:00       INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2023-12-20T15:36:11+01:00       INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2023-12-20T15:36:11+01:00       INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2023-12-20T15:36:11+01:00       INFO    Starting Controller     {"controller": "cert-rotator"}
2023-12-20T15:36:11+01:00       ERROR   cert-rotation   could not refresh cert on startup       {"error": "acquiring secret to update certificates: Secret \"jobset-webhook-server-cert\" not found", "errorVerbose": "Secret \"jobset-webhook-server-cert\" not found\nacquiring secret to update certificates\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded.func1\n\t/Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:304\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/backoff.go:461\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded\n\t/Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:337\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:265\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/runnable_group.go:223\nruntime.goexit\n\t/opt/homebrew/opt/go/libexec/src/runtime/asm_arm64.s:1197"}
github.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start
        /Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:266
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
        /Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/runnable_group.go:223
2023-12-20T15:36:11+01:00       INFO    cert-rotation   stopping cert rotator controller
2023-12-20T15:36:11+01:00       INFO    Stopping and waiting for non leader election runnables
2023-12-20T15:36:11+01:00       INFO    Starting workers        {"controller": "cert-rotator", "worker count": 1}
2023-12-20T15:36:11+01:00       INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "cert-rotator"}
2023-12-20T15:36:11+01:00       INFO    All workers finished    {"controller": "cert-rotator"}
2023-12-20T15:36:11+01:00       INFO    Stopping and waiting for leader election runnables
2023-12-20T15:36:11+01:00       INFO    Stopping and waiting for caches
E1220 15:36:11.431815   80872 reflector.go:147] pkg/mod/k8s.io/client-go@v0.28.4/tools/cache/reflector.go:229: Failed to watch *v1.Secret: Get "https://34.23.75.206/api/v1/namespaces/jobset-system/secrets?allowWatchBookmarks=true&resourceVersion=236144811&timeoutSeconds=565&watch=true": context canceled
2023-12-20T15:36:11+01:00       INFO    Stopping and waiting for webhooks
2023-12-20T15:36:11+01:00       INFO    Stopping and waiting for HTTP servers
2023-12-20T15:36:11+01:00       INFO    controller-runtime.metrics      Shutting down metrics server with timeout of 1 minute
2023-12-20T15:36:11+01:00       INFO    shutting down server    {"kind": "health probe", "addr": "[::]:8081"}
2023-12-20T15:36:11+01:00       INFO    Wait completed, proceeding to shutdown the manager
2023-12-20T15:36:11+01:00       ERROR   setup   problem running manager {"error": "acquiring secret to update certificates: Secret \"jobset-webhook-server-cert\" not found", "errorVerbose": "Secret \"jobset-webhook-server-cert\" not found\nacquiring secret to update certificates\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded.func1\n\t/Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:304\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/backoff.go:461\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded\n\t/Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:337\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:265\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/Users/zele/.gvm/pkgsets/go1.21.5/global/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/runnable_group.go:223\nruntime.goexit\n\t/opt/homebrew/opt/go/libexec/src/runtime/asm_arm64.s:1197"}
main.main
        /Users/zele/Projects/gresearch/jobset/main.go:138
runtime.main
        /opt/homebrew/opt/go/libexec/src/runtime/proc.go:267
exit status 1
make: *** [run] Error 1

kind - config with internalcert

First I run make install which successfully installs the JobSet CRDs.

After that I run kubectl apply --server-side -f config/default, it fails to create the JobSet Controller Pod:

 $ kgdep
NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
jobset-controller-manager   0/1     0            0           13s

$ kg rs
NAME                                  DESIRED   CURRENT   READY   AGE
jobset-controller-manager-87857fbc9   1         0         0       40s

$ kd rs jobset-controller-manager-87857fbc9
...
...
  Type     Reason        Age                 From                   Message
  ----     ------        ----                ----                   -------
  Warning  FailedCreate  17s (x14 over 58s)  replicaset-controller  Error creating: Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://jobset-webhook-service.jobset-system.svc:443/mutate--v1-pod?timeout=10s": dial tcp 10.96.52.173:443: connect: connection refused

$ kgsec
NAME                         TYPE     DATA   AGE
jobset-webhook-server-cert   Opaque   0      88s

$ kvs jobset-webhook-server-cert
Error: secret is empty

It fails to start because Mutating & Validating Webhook Configuration objects are not patched, plus the certificate does not get generated.

kind - config with certmanager

If I switch cert management to cert-manager I get slightly better results but still doesn't start.

It fails because the cert-controller is looking for a secret under a different name.
Secret name for cert-controller is hardcoded jobset-webhook-server-cert here https://github.com/kubernetes-sigs/jobset/blob/main/pkg/util/cert/cert.go#L26

 $ kgpo
NAME                                        READY   STATUS    RESTARTS     AGE
jobset-controller-manager-64cd9fb75-qnmbm   1/2     Running   1 (2s ago)   4s

 $ klo jobset-controller-manager-64cd9fb75-qnmbm                          
2023-12-20T15:03:27Z    INFO    setup   both healthz and readyz check are finished and configured
2023-12-20T15:03:27Z    INFO    setup   waiting for the cert generation to complete
2023-12-20T15:03:27Z    INFO    setup   starting manager
2023-12-20T15:03:27Z    INFO    controller-runtime.metrics      Starting metrics server
2023-12-20T15:03:27Z    INFO    controller-runtime.metrics      Serving metrics server  {"bindAddress": "127.0.0.1:8080", "secure": false}
2023-12-20T15:03:27Z    INFO    starting server {"kind": "health probe", "addr": "[::]:8081"}
I1220 15:03:27.678476       1 leaderelection.go:250] attempting to acquire leader lease jobset-system/6d4f6a47.x-k8s.io...
2023-12-20T15:03:27Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2023-12-20T15:03:27Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2023-12-20T15:03:27Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2023-12-20T15:03:27Z    INFO    Starting Controller     {"controller": "cert-rotator"}
I1220 15:03:44.318188       1 leaderelection.go:260] successfully acquired lease jobset-system/6d4f6a47.x-k8s.io
2023-12-20T15:03:44Z    DEBUG   events  jobset-controller-manager-64cd9fb75-qnmbm_edf29816-72e4-4895-bb82-1f8080cdf1ee became leader    {"type": "Normal", "object": {"kind":"Lease","namespace":"jobset-system","name":"6d4f6a47.x-k8s.io","uid":"cbac991e-0abc-4f31-a5ef-0c3af862fce5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"2294963"}, "reason": "LeaderElection"}
2023-12-20T15:03:44Z    INFO    cert-rotation   starting cert rotator controller
2023-12-20T15:03:44Z    INFO    Starting workers        {"controller": "cert-rotator", "worker count": 1}
2023-12-20T15:03:44Z    ERROR   cert-rotation   could not refresh cert on startup       {"error": "acquiring secret to update certificates: Secret \"jobset-webhook-server-cert\" not found", "errorVerbose": "Secret \"jobset-webhook-server-cert\" not found\nacquiring secret to update certificates\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded.func1\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:304\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/backoff.go:461\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:337\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:265\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/runnable_group.go:223\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"}
github.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start
        /go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:266
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/runnable_group.go:223
2023-12-20T15:03:44Z    INFO    cert-rotation   stopping cert rotator controller
2023-12-20T15:03:44Z    INFO    Stopping and waiting for non leader election runnables
2023-12-20T15:03:44Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "cert-rotator"}
2023-12-20T15:03:44Z    INFO    All workers finished    {"controller": "cert-rotator"}
2023-12-20T15:03:44Z    INFO    Stopping and waiting for leader election runnables
2023-12-20T15:03:44Z    INFO    Stopping and waiting for caches
2023-12-20T15:03:44Z    INFO    Stopping and waiting for webhooks
2023-12-20T15:03:44Z    INFO    Stopping and waiting for HTTP servers
2023-12-20T15:03:44Z    INFO    controller-runtime.metrics      Shutting down metrics server with timeout of 1 minute
2023-12-20T15:03:44Z    INFO    shutting down server    {"kind": "health probe", "addr": "[::]:8081"}
2023-12-20T15:03:44Z    INFO    Wait completed, proceeding to shutdown the manager
2023-12-20T15:03:44Z    DEBUG   events  jobset-controller-manager-64cd9fb75-qnmbm_edf29816-72e4-4895-bb82-1f8080cdf1ee stopped leading  {"type": "Normal", "object": {"kind":"Lease","namespace":"jobset-system","name":"6d4f6a47.x-k8s.io","uid":"cbac991e-0abc-4f31-a5ef-0c3af862fce5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"2294965"}, "reason": "LeaderElection"}
2023-12-20T15:03:44Z    ERROR   error received after stop sequence was engaged  {"error": "leader election lost"}
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/internal.go:490
2023-12-20T15:03:44Z    ERROR   setup   problem running manager {"error": "acquiring secret to update certificates: Secret \"jobset-webhook-server-cert\" not found", "errorVerbose": "Secret \"jobset-webhook-server-cert\" not found\nacquiring secret to update certificates\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded.func1\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:304\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/backoff.go:461\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:337\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:265\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/runnable_group.go:223\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"}
main.main
        /workspace/main.go:134
runtime.main
        /usr/local/go/src/runtime/proc.go:250

kind - release v0.3.0

If I try to run the guide from here https://github.com/kubernetes-sigs/jobset/blob/main/docs/setup/install.md#install-a-released-version, the objects get created, deployment has 0/1 available replicas, and describing the ReplicaSet I get:

Events:
  Type     Reason        Age                From                   Message
  ----     ------        ----               ----                   -------
  Warning  FailedCreate  1s (x12 over 11s)  replicaset-controller  Error creating: Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://jobset-webhook-service.jobset-system.svc:443/mutate--v1-pod?timeout=10s": dial tcp 10.96.56.31:443: connect: connection refused

Looks like the Pod Webhooks get created first, and it refuses creation of the Operator Pod as the Webhooks aren't properly configured, i.e. the caBundle isn't patched.

And if we examine the manifests.yaml from the v0.3.0 release we can see that they don't contain the cert-manager.io/inject-ca-from annotation from cert-manager and the Operator Pod cannot start and patch it manually as the Pod Webhooks get created first:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: jobset-validating-webhook-configuration
webhooks:
- admissionReviewVersions:
  - v1
  ...

Tasks

Preview Give feedback
No tasks being tracked yet.
@dejanzele
Copy link
Contributor Author

cc @kannon92 @danielvegamyhre

@kannon92
Copy link
Contributor

I don't think we support building jobset on M1/M2 at the moment.

See #237.

@dejanzele
Copy link
Contributor Author

dejanzele commented Dec 20, 2023

But I also have issues when running it in kind from the v0.3.0 release, see my last section (kind - release v0.3.0
).

@kannon92
Copy link
Contributor

Yea, we don't support building it and we don't have a published arm image for jobset. It would be a good contribution if you are interested in looking into it.

I'll see if I can reproduce your issue.

@dejanzele
Copy link
Contributor Author

dejanzele commented Dec 20, 2023

Still it looks strange as I can successfully run the v0.3.0 in kind, also build it successfully locally, I always get issues with certificates in the internal cert-controller

@kannon92
Copy link
Contributor

Yea I can see this also now.

Something seems wrong with kind as I don't actually get the jobset controller to start.

I also used kind 0.2.0 and I had issues installing Kueue also.

@kannon92
Copy link
Contributor

So I notice that on amd64 I was also seeing issues in main. I thought it was related to latest version of kind and I opened up a PR to update it.

I can see that some jobs will fail for this error (deployment is not present and logs in control plane show a fail to create the webhook).

However the e2e tests sometimes work but doing the install manually seemed to have issues.

I am going to try installing in on a live cluster today and see if it’s an issue with kind.

@ahg-g
Copy link
Contributor

ahg-g commented Jan 16, 2024

do you still have this problem? I think #362 is the fix.

@dejanzele
Copy link
Contributor Author

I can confirm now that it works on Macbook Pro M1 (Sonoma 14.2.1) and kind v0.20.0.
It runs fine both locally (local development) and in cluster.

I tried version v0.3.1.

Thanks to everybody involved in fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants