Skip to content
This repository has been archived by the owner on Sep 2, 2024. It is now read-only.

kafka-controller-manager CrashLoopBackOff when unreconciled KafkaSources already exist during startup #494

Open
maschmid opened this issue Mar 31, 2021 · 11 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)

Comments

@maschmid
Copy link
Contributor

Describe the bug
With eventing-kafka 0.20

When re-installing eventing and eventing-kafka (in this case using Serverless operator, but the issue could possibly be triggered other ways), kafka-controller-manager sometimes CrashLoopBackOff s

showing

E0331 18:45:15.080496       1 reflector.go:178] knative.dev/pkg/controller/controller.go:671: Failed to list *v1beta1.KafkaSource: conversion webhook for sources.knative.dev/v1alpha1, Kind=KafkaSource failed: Post "https://kafka-source-webhook.knative-eventing.svc:443/resource-conversion?timeout=30s": no endpoints available for service "kafka-source-webhook"

errors and eventually

{"level":"fatal","ts":1617216316.2632017,"logger":"fallback","caller":"injection/injection.go:63","msg":"Failed to start informers","error":"failed to wait for cache at index 6 to sync","stacktrace":"knative.dev/pkg/injection.EnableInjectionOrDie.func1\n\t/go/src/knative.dev/eventing-kafka/vendor/knative.dev/pkg/injection/injection.go:63\nknative.dev/pkg/injection/sharedmain.MainWithConfig\n\t/go/src/knative.dev/eventing-kafka/vendor/knative.dev/pkg/injection/sharedmain/main.go:231\nknative.dev/pkg/injection/sharedmain.MainWithContext\n\t/go/src/knative.dev/eventing-kafka/vendor/knative.dev/pkg/injection/sharedmain/main.go:142\nmain.main\n\t/go/src/knative.dev/eventing-kafka/cmd/source/controller/main.go:192\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}

Expected behavior
kafka-controller-manager should not crash during startup when KafkaSources already exist.

To Reproduce
Install eventing-kafka
Uninstall eventing-kafka (leave the CRD)
Create a few KafkaSources
Install eventing-kafka

Knative release version
0.20.0

Additional context
The problem seems to be that the kafka-controller-manager uses v1beta1 KafkaSource APIs for sources stored as v1alpha1 during startup. Listing as v1beta1 requires conversion, which is supposed to be handled by the very same kafka-controller-manager that is not yet ready. This leads to a vicious cycle.

spec:
  conversion:
    strategy: Webhook
    webhook:
      clientConfig:
        caBundle: ...
        service:
          name: kafka-source-webhook
          namespace: knative-eventing
          path: /resource-conversion
          port: 443
      conversionReviewVersions:
      - v1
      - v1beta1

(kafka-source-webhook points to the kafka-controller-manager)

2021/03/31 18:44:54 Registering 4 clients
2021/03/31 18:44:54 Registering 4 informer factories
2021/03/31 18:44:54 Registering 8 informers
2021/03/31 18:44:54 Registering 7 controllers
{"level":"info","ts":"2021-03-31T18:44:54.832Z","caller":"logging/config.go:116","msg":"Successfully created the logger."}
{"level":"info","ts":"2021-03-31T18:44:54.832Z","caller":"logging/config.go:117","msg":"Logging level set to: info"}
{"level":"info","ts":"2021-03-31T18:44:54.832Z","caller":"logging/config.go:79","msg":"Fetch GitHub commit ID from kodata failed","error":"\"KO_DATA_PATH\" does not exist or is empty"}
{"level":"info","ts":"2021-03-31T18:44:54.832Z","logger":"kafka-controller","caller":"profiling/server.go:64","msg":"Profiling enabled: false"}
{"level":"info","ts":"2021-03-31T18:44:54.837Z","logger":"kafka-controller","caller":"leaderelection/context.go:46","msg":"Running with Standard leader election"}
{"level":"info","ts":"2021-03-31T18:44:54.838Z","logger":"kafka-controller","caller":"binding/controller.go:86","msg":"Setting up event handlers"}
{"level":"info","ts":"2021-03-31T18:44:54.838Z","logger":"kafka-controller","caller":"source/controller.go:67","msg":"Setting up kafka event handlers"}
{"level":"info","ts":"2021-03-31T18:44:54.847Z","logger":"kafka-controller","caller":"sharedmain/main.go:209","msg":"Starting configuration manager..."}
{"level":"info","ts":"2021-03-31T18:44:54.855Z","logger":"kafka-controller","caller":"metrics/metrics_worker.go:76","msg":"Flushing the existing exporter before setting up the new exporter."}
{"level":"info","ts":"2021-03-31T18:44:54.855Z","logger":"kafka-controller","caller":"metrics/prometheus_exporter.go:51","msg":"Created Prometheus exporter with config: &{knative.dev/sources kafka_controller prometheus 5000000000 <nil> <nil>  false 9090 0.0.0.0 false   {   false}}. Start the server for Prometheus exporter."}
{"level":"info","ts":"2021-03-31T18:44:54.855Z","logger":"kafka-controller","caller":"metrics/metrics_worker.go:91","msg":"Successfully updated the metrics exporter; old config: <nil>; new config &{knative.dev/sources kafka_controller prometheus 5000000000 <nil> <nil>  false 9090 0.0.0.0 false   {   false}}"}
{"level":"info","ts":1617216294.9481893,"logger":"fallback","caller":"injection/injection.go:61","msg":"Starting informers..."}
E0331 18:44:54.964481       1 reflector.go:178] knative.dev/pkg/controller/controller.go:671: Failed to list *v1beta1.KafkaSource: conversion webhook for sources.knative.dev/v1alpha1, Kind=KafkaSource failed: Post "https://kafka-source-webhook.knative-eventing.svc:443/resource-conversion?timeout=30s": no endpoints available for service "kafka-source-webhook"
{"level":"warn","ts":"2021-03-31T18:44:55.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:44:55 http: TLS handshake error from 10.130.2.1:39848: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:44:56.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:44:56 http: TLS handshake error from 10.130.2.1:40070: tls: no certificates configured
E0331 18:44:56.523223       1 reflector.go:178] knative.dev/pkg/controller/controller.go:671: Failed to list *v1beta1.KafkaSource: conversion webhook for sources.knative.dev/v1alpha1, Kind=KafkaSource failed: Post "https://kafka-source-webhook.knative-eventing.svc:443/resource-conversion?timeout=30s": no endpoints available for service "kafka-source-webhook"
{"level":"warn","ts":"2021-03-31T18:44:57.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:44:57 http: TLS handshake error from 10.130.2.1:40280: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:44:58.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:44:58 http: TLS handshake error from 10.130.2.1:40490: tls: no certificates configured
E0331 18:44:58.611783       1 reflector.go:178] knative.dev/pkg/controller/controller.go:671: Failed to list *v1beta1.KafkaSource: conversion webhook for sources.knative.dev/v1alpha1, Kind=KafkaSource failed: Post "https://kafka-source-webhook.knative-eventing.svc:443/resource-conversion?timeout=30s": no endpoints available for service "kafka-source-webhook"
{"level":"warn","ts":"2021-03-31T18:44:59.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:44:59 http: TLS handshake error from 10.130.2.1:40706: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:00.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:00 http: TLS handshake error from 10.130.2.1:40920: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:01.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:01 http: TLS handshake error from 10.130.2.1:41134: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:02.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:02 http: TLS handshake error from 10.130.2.1:41372: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:03.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:03 http: TLS handshake error from 10.130.2.1:41590: tls: no certificates configured
E0331 18:45:03.466393       1 reflector.go:178] knative.dev/pkg/controller/controller.go:671: Failed to list *v1beta1.KafkaSource: conversion webhook for sources.knative.dev/v1alpha1, Kind=KafkaSource failed: Post "https://kafka-source-webhook.knative-eventing.svc:443/resource-conversion?timeout=30s": no endpoints available for service "kafka-source-webhook"
{"level":"warn","ts":"2021-03-31T18:45:04.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:04 http: TLS handshake error from 10.130.2.1:41790: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:05.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:05 http: TLS handshake error from 10.130.2.1:41992: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:06.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:06 http: TLS handshake error from 10.130.2.1:42198: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:07.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:07 http: TLS handshake error from 10.130.2.1:42396: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:08.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:08 http: TLS handshake error from 10.130.2.1:42596: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:09.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:09 http: TLS handshake error from 10.130.2.1:42794: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:10.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:10 http: TLS handshake error from 10.130.2.1:42996: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:11.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:11 http: TLS handshake error from 10.130.2.1:43200: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:12.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:12 http: TLS handshake error from 10.130.2.1:43404: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:13.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:13 http: TLS handshake error from 10.130.2.1:43628: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:14.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:14 http: TLS handshake error from 10.130.2.1:43830: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:14.245Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:14 http: TLS handshake error from 10.130.2.1:43852: tls: no certificates configured
E0331 18:45:15.080496       1 reflector.go:178] knative.dev/pkg/controller/controller.go:671: Failed to list *v1beta1.KafkaSource: conversion webhook for sources.knative.dev/v1alpha1, Kind=KafkaSource failed: Post "https://kafka-source-webhook.knative-eventing.svc:443/resource-conversion?timeout=30s": no endpoints available for service "kafka-source-webhook"
{"level":"warn","ts":"2021-03-31T18:45:15.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:15 http: TLS handshake error from 10.130.2.1:44032: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:15.245Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:15 http: TLS handshake error from 10.130.2.1:44058: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:16.099Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:16 http: TLS handshake error from 10.130.2.1:44238: tls: no certificates configured
{"level":"warn","ts":"2021-03-31T18:45:16.245Z","logger":"kafka-controller","caller":"webhook/webhook.go:201","msg":"server key missing"}
2021/03/31 18:45:16 http: TLS handshake error from 10.130.2.1:44262: tls: no certificates configured
{"level":"fatal","ts":1617216316.2632017,"logger":"fallback","caller":"injection/injection.go:63","msg":"Failed to start informers","error":"failed to wait for cache at index 6 to sync","stacktrace":"knative.dev/pkg/injection.EnableInjectionOrDie.func1\n\t/go/src/knative.dev/eventing-kafka/vendor/knative.dev/pkg/injection/injection.go:63\nknative.dev/pkg/injection/sharedmain.MainWithConfig\n\t/go/src/knative.dev/eventing-kafka/vendor/knative.dev/pkg/injection/sharedmain/main.go:231\nknative.dev/pkg/injection/sharedmain.MainWithContext\n\t/go/src/knative.dev/eventing-kafka/vendor/knative.dev/pkg/injection/sharedmain/main.go:142\nmain.main\n\t/go/src/knative.dev/eventing-kafka/cmd/source/controller/main.go:192\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
@maschmid maschmid added the kind/bug Categorizes issue or PR as related to a bug. label Mar 31, 2021
@aliok
Copy link
Member

aliok commented Apr 1, 2021

/assign

@aliok
Copy link
Member

aliok commented Apr 1, 2021

I was able to reproduce this one.

Problem is that kafka-controller-manager has both the reconciler and the webhook in it.

The process is something like this:

  • Webhook is started
  • Informers are synced - blocking
  • Controllers are started. 2 things happen here
    • Certificate controller takes care of the webhook certificate
    • Reconcilers for KafkaSource etc. is started

When there are no existing KafkaSource CRs, certificate controller handles the certificate and later when a CR is created, webhook is called fine. Also, informers call the reconciler fine.

When there is an existing KafkaSource CR and you just start the kafka-controller-manager pod, syncing the informers causes the API server to go and call the webhook for KafkaSource conversion. However, webhook's certificate is not yet setup.

I think this only happens when there's a clean uninstall and then install aka when there's no certificate setup in the secret. I don't think it will have an effect visible during any upgrades. The cert will be there in that case.

I can see 2 alternative solutions:

  • Separate the controller and webhook into 2 different components
  • Somehow fix the sharedmain in knative/dev - which is not that easy because of the race issues

@aliok
Copy link
Member

aliok commented Apr 1, 2021

@vaikas @slinkydeveloper mind having a look?

The same problem happens in knative/eventing as well: knative/eventing#4876

@matzew
Copy link
Contributor

matzew commented Apr 26, 2021

@vaikas any comments on the above?

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2021
@cbl315
Copy link

cbl315 commented Nov 1, 2021

Any update to fix this issue?

@lionelvillard
Copy link
Contributor

/remove-lifecycle stale
/reopen

@knative-prow-robot
Copy link
Contributor

@lionelvillard: Reopened this issue.

In response to this:

/remove-lifecycle stale
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 1, 2021
@lionelvillard
Copy link
Contributor

Seems like separating the webhook from the controller should fix this issue.

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2022
@pierDipi
Copy link
Member

/remove-lifecycle stale
/triage accepted

@knative-prow-robot knative-prow-robot added triage/accepted Issues which should be fixed (post-triage) and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
Development

No branches or pull requests

7 participants