Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster operator network Degraded is True with RolloutHung: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-7j854 is in CrashLoopBackOff State #7265

Closed
srinivasabb opened this issue Jun 21, 2023 · 11 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@srinivasabb
Copy link

Version: release 4.13 branch

#Env:
Azure VM

Error from Console:
[root@sriniedgeonokdbootstrap bin]# ./openshift-install create cluster
? Platform azure
INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json"
? Region westeurope
? Base Domain edgeaiokd.clusters.openshiftcorp.com
? Cluster Name edgeclusterokd
? Pull Secret [? for help] ***************************************************************************************************************************************************************************************INFO Creating infrastructure resources...
INFO Waiting up to 20m0s (until 1:52AM) for the Kubernetes API at https://api.edgeclusterokd.edgeaiokd.clusters.openshiftcorp.com:6443...
INFO API v1.26.4-2835+7d221229dc9796-dirty up
INFO Waiting up to 30m0s (until 2:09AM) for bootstrapping to complete...
INFO Pulling VM console logs
INFO Pulling debug logs from the bootstrap machine
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
ERROR Cluster operator network Degraded is True with RolloutHung: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-7j854 is in CrashLoopBackOff State
ERROR DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-kllcn is in CrashLoopBackOff State
ERROR DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-xw7hg is in CrashLoopBackOff State
ERROR DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2023-06-21T05:39:21Z
INFO Cluster operator network ManagementStateDegraded is False with :
INFO Cluster operator network Progressing is True with Deploying: DaemonSet "/openshift-multus/network-metrics-daemon" is waiting for other operators to become ready
INFO DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 3 nodes)
INFO DaemonSet "/openshift-network-diagnostics/network-check-target" is waiting for other operators to become ready
INFO Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
INFO Deployment "/openshift-cloud-network-config-controller/cloud-network-config-controller" is waiting for other operators to become ready
INFO Deployment "/openshift-multus/multus-admission-controller" is waiting for other operators to become ready
ERROR Bootstrap failed to complete: timed out waiting for the condition
ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
WARNING The bootstrap machine is unable to resolve API and/or API-Int Server URLs
INFO Error: error while checking pod status: timed out waiting for the condition
INFO Using /opt/openshift/auth/kubeconfig as KUBECONFIG
INFO Gathering cluster resources ...

image

image

OVN Pod Logs

[root@sriniedgeonokdbootstrap auth]# kubectl logs ovnkube-node-7j854 -n openshift-ovn-kubernetes -f
Defaulted container "ovn-controller" out of: ovn-controller, ovn-acl-logging, kube-rbac-proxy, kube-rbac-proxy-ovn-metrics, ovnkube-node, drop-icmp
2023-06-21T05:39:34+00:00 - starting ovn-controller
2023-06-21T05:39:34Z|00001|vlog|INFO|opened log file /var/log/ovn/acl-audit-log.log
2023-06-21T05:39:34.525Z|00002|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2023-06-21T05:39:34.525Z|00003|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
2023-06-21T05:39:35.527Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2023-06-21T05:39:35.527Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
2023-06-21T05:39:35.527Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: waiting 2 seconds before reconnect
2023-06-21T05:39:37.529Z|00007|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2023-06-21T05:39:37.529Z|00008|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
2023-06-21T05:39:37.529Z|00009|reconnect|INFO|unix:/var/run/openvswitch/db.sock: waiting 4 seconds before reconnect
2023-06-21T05:39:41.531Z|00010|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2023-06-21T05:39:41.531Z|00011|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
2023-06-21T05:39:41.531Z|00012|reconnect|INFO|unix:/var/run/openvswitch/db.sock: continuing to reconnect in the background but suppressing further logging

Please find the tar file for bootstrap logs

log-bundle-20230620151915.tar.gz

@srinivasabb
Copy link
Author

Hi any update on this issue ?

@ArthurVardevanyan
Copy link

Error:

ovs-ctl[1473]: id: 'openvswitch': no such user

I was able to finish the installation by following:
https://access.redhat.com/solutions/3494661

I had to do this on the masters and workers.

I also had to overide the MCP Config Annotations on the three masters in order to finish the installation.

@srinivasabb
Copy link
Author

srinivasabb commented Jun 27, 2023

hi @ArthurVardevanyan Thanks for the comment. Can you let me know the steps you followed to solve this ?The link doesnt show any fix [https://access.redhat.com/solutions/3494661] And did you face this in 4.13 ? Also my worker node didnt come up only.

@knthm
Copy link

knthm commented Aug 30, 2023

I have this problem deploying RHCOS 4.14-9.2 on libvirt.

The root cause seems to be the systemd-sysusers service configuration:

Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating group 'hugetlbfs' with GID 978.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating group 'openvswitch' with GID 977.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating group 'unbound' with GID 976.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating user 'openvswitch' (Open vSwitch Daemons) with UID 977 and GID 977.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating user 'unbound' (Unbound DNS resolver) with UID 976 and GID 976.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: /etc/gshadow: Group "unbound" already exists.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd[1]: systemd-sysusers.service: Main process exited, code=exited, status=1/FAILURE
Aug 30 08:59:12 clust-hxhnn-master-0 systemd[1]: systemd-sysusers.service: Failed with result 'exit-code'.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd[1]: Failed to start systemd-sysusers.service - Create System Users.

There are duplicate entries in /usr/lib/sysusers.d for the unbound and openvswitch services, causing the service to terminate before it was able to create the necessary users and groups.

I'm guessing this is a regression from RHCOS moving to the current Fedora 38 base: both the RHCOS rpmostree and Fedora 38 openvswitch and unbound package configurations are clashing.
This'll most likely need to be resolved upstream in the RHCOS build configuration.

@zhengxiaomei123
Copy link

I have this problem deploying RHCOS 4.14-9.2 on libvirt.

The root cause seems to be the systemd-sysusers service configuration:

Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating group 'hugetlbfs' with GID 978.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating group 'openvswitch' with GID 977.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating group 'unbound' with GID 976.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating user 'openvswitch' (Open vSwitch Daemons) with UID 977 and GID 977.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating user 'unbound' (Unbound DNS resolver) with UID 976 and GID 976.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: /etc/gshadow: Group "unbound" already exists.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd[1]: systemd-sysusers.service: Main process exited, code=exited, status=1/FAILURE
Aug 30 08:59:12 clust-hxhnn-master-0 systemd[1]: systemd-sysusers.service: Failed with result 'exit-code'.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd[1]: Failed to start systemd-sysusers.service - Create System Users.

There are duplicate entries in /usr/lib/sysusers.d for the unbound and openvswitch services, causing the service to terminate before it was able to create the necessary users and groups.

I'm guessing this is a regression from RHCOS moving to the current Fedora 38 base: both the RHCOS rpmostree and Fedora 38 openvswitch and unbound package configurations are clashing. This'll most likely need to be resolved upstream in the RHCOS build configuration.

I meet the same error :

Oct 19 03:29:26 test1-mkqrw-master-0 systemd-sysusers[738]: Creating group 'hugetlbfs' with GID 978.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd-sysusers[738]: Creating group 'openvswitch' with GID 977.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd-sysusers[738]: Creating group 'unbound' with GID 976.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd-sysusers[738]: Creating user 'openvswitch' (Open vSwitch Daemons) with UID 977 and GID 977.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd-sysusers[738]: Creating user 'unbound' (Unbound DNS resolver) with UID 976 and GID 976.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd-sysusers[738]: /etc/gshadow: Group "openvswitch" already exists.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd[1]: systemd-sysusers.service: Main process exited, code=exited, status=1/FAILURE
Oct 19 03:29:26 test1-mkqrw-master-0 systemd[1]: systemd-sysusers.service: Failed with result 'exit-code'.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd[1]: Failed to start systemd-sysusers.service - Create System Users.

I saw your comment here: #7265 (comment)
Any workaround for this? Thanks.

@knthm
Copy link

knthm commented Oct 20, 2023

@zhengxiaomei123 OpenShift libvirt IPI seems to be more or less unmaintained. This unfortunately isn't reflected in the documentation within this repo.

I ended up installing OpenShift following the bare-metal UPI instructions inside my libvirt environment.

@zhengxiaomei123
Copy link

@zhengxiaomei123 OpenShift libvirt IPI seems to be more or less unmaintained. This unfortunately isn't reflected in the documentation within this repo.

I ended up installing OpenShift following the bare-metal UPI instructions inside my libvirt environment.

Thanks very much. I will try the bare-metal UPI way too.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2024
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2024
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this as completed Mar 22, 2024
Copy link
Contributor

openshift-ci bot commented Mar 22, 2024

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants