-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading from 1.13 to 1.14 leaves triggers in broken state #4091
Comments
Hi @mdwhitley, thanks for reporting, can you describe what is present in your How many Kafka brokers and triggers do you have in your cluster? |
Our 1.13 environment pods:
deployments:
No statefulsets 3 brokers, 266 triggers (prod will have 2-3x as many triggers in large clusters). post-install jobs for both knative and kafka extensions complete except for |
I mean during/after the upgrade process, how are you upgrading the system? With something like this? kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.14.10/eventing-kafka-controller.yaml
kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.14.10/eventing-kafka-broker.yaml
kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.14.10/eventing-kafka-post-install.yaml |
Environment after 1.14 upgrade. This runs all knative templates+post-install followed by eventing-kafka templates+post-install. pods:
Note controlplane post-install job completed successfully prior to the 2 new dispatcher pods becoming available. Deployments:
statefulsets:
After completion all 266 triggers have gone unavailable and eventing cluster is down. |
Post-install job is supposed to delete the old dispatcher deployment, a manual deletion should help
|
I've been manually deleting the deployments in previous attempts as well. After deleting deployment:
Eventing cluster is still completely down. This is the 2nd attempt today and now 263/266 triggers are all failed with
I also observe on the kafka side a large chunk of consumers (presumably knative) have gone inactive. A couple triggers are erroring with:
|
That is an error we get from the Kafka cluster.
the autoscaler should scale up the statefulset replicas, are you setting a specific replicas number for the statefulset |
While pulling additional logs there was a burst of activity scaling up the statefulset. I had observed this happen previously as well:
The primary difference is now I see our triggers becoming ready again. The new dispatcher pods actually came up completely this time vs the error reported in my original issue here. This is the result of:
That leaves questions on why the post-install job that should cleanup the old dispatcher deployment is silently failing instead of waiting until the statefulset is completely available, and why after 1.14 is installed + old dispatcher deployment removed we have a period of time where our eventing cluster is down. |
Our default defines 2 replicas for the statefulset but it appears it auto-scales itself up to 15. When this happened earlier it blew up our k8 cluster quota so we had to increase resources. I am interested why it decided to scale up because increasing # deployments x7 is steep and we likely need a way to control that for production. |
The number of replicas depends on the configured eventing-kafka-broker/control-plane/config/eventing-kafka-broker/200-controller/500-controller.yaml Lines 151 to 153 in 4cf38f1
Each trigger represents a single "virtual replica" so you get |
Over night our 1.14 dev cluster went back to unavailable. All triggers flipped back to not ready with:
Nothing has changed with our kafka cluster and we have several hundred other kafka consumers running off this same cluster without issue. Is there anything I can look into further to see any error logs from a running pod? |
Can you share / look at the |
controller logs are filled with
I've tried setting the kafka-controller
I am not sure what is trying to scale dispatchers, but it is out of control. |
you need to increase also described here how many replicas you get given number of triggers and pod capacity configuration #4091 (comment) |
Yep, that did it. I increased |
do you still see the issue? |
The cluster is not stable, no. Over the course of the past hour we've now had a handful of triggers become unready with
|
I spent some more time trying to track down those I've engaged our Kafka service provider and their initial analysis of our cluster shows no issues. At this point the issue is directly linked to just Knative and began as part of the 1.14 upgrade attempt. So either Knative 1.14 bug, error with our configuration of this upgrade, or a combination of both. I've handled most all of our upgrades since 1.8.x, so this is very much an outlier in that we can't get anything to work/stay stable for more than a few hours. |
Does it ever recovers from EOF errors? How long it takes to recover from EOF errors? |
As far as I can tell, it does not automatically recover. I believe what I saw yesterday morning, with things initially working after deploying only to slowly have trigger readiness fail, is what happened over an extended period of time the previous night. Eventually no triggers are available and there are no events being received/sent by brokers. I am in the process of setting up a new separate dev cluster to install 1.13 on and attempt the 1.14 upgrade there for a baseline comparison. |
I have reproduced the
The post-install job that is supposed to cleanup old dispatcher deployments still no-ops without doing anything. Still manually running:
I do still observe events being able to be sent to both If it is helpful I can bundle up our entire config (eventing-core + kafka templates) and provide it internally for review. |
After 12 hours the triggers have not changed and are stilled marked as not ready due to the I have also experimented with pointing this isolated cluster to use our staging kafka instance. The same error shows up immediately in the kafka-controller and eventing is down. The dispatcher statefulsets, after switching kafka, will not restart (gave up waiting after 30 minutes) and are just hung on
Our staging kafka instance has less total topics/partitions, but is much larger and drives throughput (around 20,000 non-knative events/minute) that is closer to our production clusters. |
Backport to speed up mount volume #4101, I think that will help on the latest issue |
With #4103, once backported to 1.14, we can try to get more debug logs about the Kafka client connections or disable the new client pooling
|
I tried out 1.14.11 on my test cluster and have promising results. I pulled resource updates to our k8 templates from the release and things immediately began failing with EOF again, but I noticed that the yaml resources didn't have any of the new ENVs. I added them into the controller deployment:
which has resulted in a stable cluster so far. Deployment of our test broker/trigger/services all came up and eventing is flowing with none of the previous EOF errors. I do see a few of
but thus far the triggers and eventing have remained available. I will continue to let this sit for a day in this test environment before I wipe it out and start back with our base 1.13 and try an upgrade path. Is there anything I could get/try for you in the meantime? |
After around 1-2 hours triggers became not ready. Same old EOF errors spewing in kafka-controller pods. I modified kafka-controller to set:
When everything came back up, triggers went back to ready state. Another hour or so later and triggers are again not ready. I've captured the latest logs and sent them along via email. |
The EOF on the leaderelection is a little odd.
This is not yet released and merged, in the previous reply I was referring to #4091 (comment):
|
Same problem is happening in 1.15 |
Id like to point out that my problem stemmed from running knative kafka across two different clusters. Each knative instance resulted in duplicate consumergroups for two different topics. This then resulted in consumers never committing offsets or polling messages. Eventually leading to all triggers being taken down. To fix this I changed the template from: knative-trigger-{{ .Namespace }}-{{ .Name }} -> knative-trigger-{{ .Namespace }}-{{ .Name }}-{{ .Environment }} which results in unique consumer groups. I then changed the offset to earliest to make sure we didnt lose any messages and then re ran our cicd pipelines to re-create the triggers after I deleted them. This has resulted in no longer seeing:
What lead me to this path was this log line:
|
Our dev configurations are similar to what @elebiodaslingshot described, but we do not have any JOIN_GROUP errors. The closest I have seen is
showed up once over the past week and does not appear to correlate with our trigger readiness. That said, the scenario makes a lot of sense. We have, starting very recently, begun running multiple dev knative clusters off the same Kafka instance. Those were on 1.13 and seem to tolerate this issue as there is no evidence of it impacting eventing. While some of our triggers are ancient and still use GUID consumer groups, the majority would use the I've made similar adjustments to include environment details in |
I've being trying to reproduce for a while with varying level of "scale", the only thing I found was a rather slow time to recover in the scheduler / autoscaler (see here for more details: knative/eventing#8200) but I could not reproduce the EOF errors. |
@pierDipi is there an eta on new 1.14 release containing #4103? I've continued to test with 1.14 without much change. Initially I did find some different errors when running kafka-controller with debug logging. Instead of I've attempted to jump to 1.15 to get access to the sarama logging ENVs but that fairs even worse. I am able to get services, brokers and triggers all up but no events are flowing through KE stack. I haven't spent much time digging through that release, but it looks like changes are sufficient that our existing configs are broken. I've also had some discussions with other internal teams. We have a component that also interfaces with Kafka written in Go and uses the Sarama kafka client. They have taken a look at the periods where KE is receiving |
Note also that there is the template for the topic name that is based solely on namespace and name Lines 48 to 50 in b02817c
|
It will be released shortly once one of the jobs here succeeds https://prow.knative.dev/?type=periodic&job=release_eventing-kafka-broker_release-1.14_periodic |
I deployed 1.14.12 with sarama debug logging turned on and observed the error when triggers go down after working for a while:
which seems to indicate Knative/Sarama client is not properly closing/refreshing connections. I have also observed the same
I also tried with
which resulted in the latter example of pods coming up with immediate EOF errors and down eventing. |
Can you share the leases in the knative-eventing namespace to identify the leader pod we need to focus on?
|
A few other people are encountering EOF here with sarama 1.43 and Go 1.122 IBM/sarama#2938 |
The issue I'm having is that sarama is never logging the actual error associated with https://github.com/IBM/sarama/blob/893978c87fe7af13a2b6849ba62b003493f97f25/broker.go#L1243-L1252
For now, let's keep this configuration and continue gather logs and insights as it helps with reducing the moving parts |
Can you share the Kafka cluster version you're connecting to? |
Snapshot of
Our Kafka clusters are version I've tried deploying 1.14 with increased req/lim CPU resources for The
In cases where the connection is failed, the client should be destroyed and recreated and in this case it looks like it just continues to try and reuse a dead connection. My understanding of our internal components using Go/Sarama is that they required additional resiliency added to support production level scale. Additionally, there may need to be tweaks to Sarama client configs (timeouts and whatnot) to handle large scale clusters with hundreds of topics/thousands of partitions like what we run. On the subject of large kafka clusters, we provisioned a brand new Kafka 3.6 cluster today and deployed 1.14.12 using it. |
@mdwhitley did you disable the client pool #4091 (comment)? |
Yes. I have tried with Over the weekend I left the test cluster on 1.14 w/ client pool disabled and all triggers are currently marked as Not Ready/InitializeOffset. 1/3 kafka-controller pods is throwing constant
|
Sarama error logging issue: IBM/sarama#2994
Thanks for the hints, contributions or specific suggestions are really welcome in this area |
If you disable client pool, we create a new client each time we reconcile an object |
From IBM/sarama#2981, it seems that EOF is the error got when there is a protocol mismatch |
That and SASL configuration error. If there was a misconfiguration, then I would expect no eventing to ever work which is simply not the case in our environments. Same config is used everywhere and works except for knative 1.14. Exact same config works on 1.13 without any issues with triggers going unready. Beyond the Sarama pool/logging ENV, are there any ways to influence the kafka properties used for the sarama client? |
Release 1.14.13 which includes #4125 resolves the issues we have been experiencing. All existing triggers continue to be available/ready and new triggers can be created successfully. |
Describe the bug
Upgrading from
knative=1.13.8
andkafka-extensions=1.13.11
.knative-eventing related resources update and restart as expected.
eventing-kafka extension related resource upgrade fails:
kafka-broker-dispatcher
statefulset set starts at 3 replicas and eventually scales up to 15 replicas in our dev environment. Only 3/15 replicas end up with data populated to the correspondingkafka-broker-dispatcher-X
configmap resource. I see a continued process of the pods going through container creating state as it waits for the CM to get created and then throws an error when it finally starts upkafka-controller-post-install
job completes and outputs no logs.If (2) is replaced with a manual step to cleanup the old dispatcher deployment, all event traffic ceases and triggers go into a failed state:
Recreation of any trigger does not help. Existing broker resources continue to report ready but if any are recreated will also go into a failed state. A describe on any trigger shows a failure on old/no-longer-existing dispatcher deployment
Expected behavior
To Reproduce
Steps to reproduce the behavior.
Knative release version
knative=1.13.8
andkafka-extensions=1.13.11
Additional context
Add any other context about the problem here such as proposed priority
The text was updated successfully, but these errors were encountered: