-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node not ready after cluster upgrade to 4.7.0-0.okd-2021-08-22-163618 #878
Comments
By "log bundle", we're asking you to use the must-gather tool and post the resulting archive file for download. Details here... https://docs.okd.io/latest/support/gathering-cluster-data.html |
Hi team, I got the same issue with the same version of OKD in AWS. Here is the console output (sorry can't find a way of full log extraction from aws console):
|
A bit more console log. Seems the Kernel Panic is the reason of Reboots:
|
after the reboot the node shows no default route messages on console:
|
I have rebooted the node with AWS Console. It has been loaded fine. So, I'm waiting for new kernel panic ;) |
How do we know its the same issue? In any case, kernel panics should be reported to Fedora bugzilla |
Ok, it may be different case, but with the same symptoms. The different is I have clear OKD 4.7.0-0.okd-2021-08-22-163618 cluster, not from upgrade. The main question for me now - is it possible to downgrade cluster to previous version? @JaimeMagiera @vrutkovs should I upload must-gather here? it's 125Mb logs |
Yes, please upload a must-gather. Also, please file a Bugzilla since this is a kernel panic how-to |
As I found out, it seems related with folowing issue from Fedora CoreOS project I have replaced kernel version of all worker node at one of our cluster and will watch to it |
This is a critical issue which seems to have impacted many users. I think all of these refer to the same issue?
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@gialloguitar how did you replace the kernel version of all worker nodes ??
is it using machineconfig ?? |
Yes, all new nodes apply machineconfig when provisioning
|
Describe the bug
After latest cluster upgrade unexpectedly some worker nodes become NotReady, and just only hard reboot restore it.
Node is unavalible with a network, all pods stucked in terminating state from this node.
Version
OKD 4.7.0-0.okd-2021-08-22-163618
IPI installation with Openstack provider
Fedora CoreOS image 34.20210808.3.0
How reproducible
It happening randomly with one or two node per time. All nodes working on dedicated hypervisors without compute resource overcommitting
Log bundle
ClusterID: 47c3fb43-737e-4d4f-a992-e304ba338430
ClusterVersion: Stable at "4.7.0-0.okd-2021-08-22-163618"
ClusterOperators:
All healthy and stable
The text was updated successfully, but these errors were encountered: