-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calico-kube-controllers pod stuck in not Ready for 13 min #3751
Comments
CC: @lwr20 |
@hakman do you happen to use a sticky service for the API server? |
This is how the service looks on that cluster @fasaxc:
|
Only one endpoint? Shouldn't it have one per control-plane node? |
This is a cluster with a single master. |
I think another user has identified the root cause here: https://github.com/projectcalico/libcalico-go/issues/1267 |
Nice! Thanks for the update. |
I'm going to close this since we're tracking the root cause in https://github.com/projectcalico/libcalico-go/issues/1267 |
This link does not work https://github.com/projectcalico/libcalico-go/issues/1267 How to go about fixing this @hakman |
I am also not able to see the link |
|
FWIW, this PR claims to fix https://github.com/projectcalico/libcalico-go/issues/1267: |
Yep, the underlying issue was fixed in Calico v3.18 and should be fixed in subsequent releases as well. I can't seem to find the original GH issue link since it was migrated, but that was the fix. |
I have searched in vain to find that issue link since https://github.com/projectcalico/libcalico-go/issues/1267 moved to https://github.com/projectcalico/calico/issues (no 1267) in desperation reaching out to see if you remember what the fix was ... |
kubernetes/client-go#374 was the actual root cause, fixed by kubernetes/kubernetes#95981. |
Thank you for your reply. I debated posting this. Bit embarrassed since I am new to k8s, unsure if would waste your time. I am experiencing the
1 master and one worker, Ubuntu-AWS via kubeadm I upgraded from 1.22.1-00 to 1.23.1-00 and on right after experienced this (since I have stopped instances twice) and researching probable fix last few days..w/o any success.
I am sure you won't have time, is there a forum you could guide me to research? Thank you in advance. |
@RsheikhAii3 I don't think your problem is related to this issue but I'm sure we'll be able to help you on our slack: https://slack.projectcalico.org/ |
@fasaxc Much appreciate it sir, indeed you are correct I will pursue it on the slack channel. Just for doc purpose, on further search, the issue on calico-node logs was address already used could not bind for both on master and worker. TY to all of you. Appreciate your time and knowledge when you guide newbies |
Any update on the issue? |
@joeybdub As mentioned before, this IS fixed. If you have a similar issue, it's just something that looks similar, nothing more. |
Thanks @hakman there is an issue already for the issue experiencing Azure/AKS#2745 |
@joeybdub The AKS issue seems unrelated. Your best guess is still Slack where there may be someone more familiar with AKS that can help. Good luck! |
In a Kubernetes cluster created with Kops, replacing the master node(s) puts the
calico-kube-controllers
pod in not Ready state.It recovers on its own after about 13 min, which is quite slow.
Deleting the pod, creates a new one that becomes ready instantly.
Expected Behavior
calico-kube-controllers
should recover much faster than 13 min.Current Behavior
calico-kube-controllers
waits 13 min to recover.Possible Solution
Simplest generic fix would be to add a liveness probe that automatically restarts the pod.
Steps to Reproduce (for bugs)
--networking=calico
.This should provide the steps: https://kops.sigs.k8s.io/getting_started/aws/.
kops rolling-update cluster --yes --cloudonly --instance-group master-a --force`
calico-kube-controllers
pod.Context
Kops validates the cluster based on the status of the
kube-system
pods. This issue prevents the cluster from being upgraded without manual intervention and also slows it down.Your Environment
The text was updated successfully, but these errors were encountered: