-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calico-node pods are failing after upgrade from 3.22 to 3.23: felix is not ready: readiness probe reporting 503 #6442
Comments
We ran into this as well, modifying the This did not happen in recent cluster but did happen when rolling back our code about 6 months and upgrading a cluster to our current configs. |
Sounds like this might have been introduced by this PR? #5576 Fits the code area as well as the time frame. Perhaps we're not properly handling the difference between |
CC @coutinhop - WDYT? |
@caseydavenport very possibly so, I'm trying to understand exactly how that's happening. Currently taking a look at the logs, will try to figure this out an fix it ASAP. |
@r0bj or @mikesplain could you post the output for It would also help if you could enable debug logging on felix ( I have somewhat of a theory, considering the changes from #5576 make felix decide whether IPIP and/or VXLAN should be enabled based on the encapsulation of the existing IP pools, using the values from FelixConfiguration as overrides for those. Additionally, when the encapsulations do change, felix needs to restart to apply those, and we're seeing it do that multiple times in the log you posted, so it's likely there is a bug that didn't foresee an upgrade. I'd like to see if there's any conflicting configuration that could be causing this restart loop... Thanks! |
I'm using calico with etcd datastore (installed from the manifest https://projectcalico.docs.tigera.io/manifests/calico-etcd.yaml) so there are no |
I think you can still get those with |
Sure, this is the data:
|
Thank you all for jumping in! Let me know if theres anything else we can do to help.
and:
|
Thanks @r0bj and @mikesplain! @r0bj, if it's not too much to ask, could you send the full calico-log after enabling |
@caseydavenport do you mean the problem could be that calico/felix/calc/encapsulation_resolver.go Line 201 in 8bfde6e
That line is run at felix startup (in daemon.go) on all IP pools retrieved from the client. I was lead to believe that it would never be an empty string (it would be defaulted to "Never" as the comment says): calico/api/pkg/apis/projectcalico/v3/ippool.go Lines 53 to 55 in 8bfde6e
Let's look at the full debug logs to be sure, but do you think changing the check to |
@coutinhop Sure, there is calico-node log with |
Yep, if the pool was created prior to that field existing or created / edited with an older version of calicoctl that isn't aware of the field (or potentially another reason).
^ Can see it's not set here.
and
One thing I'm not sure about - does the cluster use etcd mode or k8s CRD mode? Or are we talking about two different clusters here? Do you recall what the original version of Calico that was installed on these clusters was? Like, the version used when the cluster was originally provisioned? |
Thanks @caseydavenport, so it seems like the issue is indeed that. I'll work on the fix!
@r0bj is using etcd mode and used |
In my case it's etcd mode so I used
Git history for my cluster shows that it was created 6 years ago (it's a bare metal cluster upgraded in-place) with calico in version 1.4.2 at that time. |
Confirming, we are using whatever the default was in kops now. At the time I think it may have been etcd but currently I believe it's crd mode.
Our git history shows the cluster is 3 years old, initially created with kops 1.11.1 & calico v3.3.1. It was installed with this config file: |
Perfect, so that supports the theory that these pools were created prior to VXLANMode being an option and the newest release is just not properly handling that case, so I think @coutinhop's fix for this in #6494 is probably good. @coutinhop it occurs to me that we should look at doing read-time defaulting of that field in case there is any other code that might be hit by the same issue. We should be able to handle that in the Calico client code so any users of the client see "VXLANMode: Never" even if the underlying data doesn't include the field. |
@caseydavenport makes sense, will look into doing that! |
@coutinhop looks like we already have a good hook to do this in: calico/libcalico-go/lib/clientv3/ippool.go Lines 243 to 244 in 0bfeb0f
|
Hey, do you plan to cherry-pick the fix to previous releases as well? We just hit it on 3.23, it looks like the fix was merged only into the master branch. |
As per @caseydavenport 's comment on the fix PR
|
Hey i am getting this issue and with these logs
|
Expected Behavior
Calico is working after upgrade to version 3.23.
Current Behavior
calico-node
pods are failing after upgrade from 3.22 to 3.23:The only logs that are not INFO:
Full log: https://gist.github.com/r0bj/1df72959f5f992efba3544fa5eb89d47
Calico manifest: https://projectcalico.docs.tigera.io/manifests/calico-etcd.yaml
Steps to Reproduce (for bugs)
Your Environment
kubernetes:
Client Version: v1.24.3
Kustomize Version: v4.5.4
Server Version: v1.24.3
Ubuntu 18.04
The text was updated successfully, but these errors were encountered: