Cluster operator network Degraded is True with RolloutHung: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-7j854 is in CrashLoopBackOff State #7265

srinivasabb · 2023-06-21T06:23:23Z

Version: release 4.13 branch

#Env:
Azure VM

Error from Console:
[root@sriniedgeonokdbootstrap bin]# ./openshift-install create cluster
? Platform azure
INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json"
? Region westeurope
? Base Domain edgeaiokd.clusters.openshiftcorp.com
? Cluster Name edgeclusterokd
? Pull Secret [? for help] ***************************************************************************************************************************************************************************************INFO Creating infrastructure resources...
INFO Waiting up to 20m0s (until 1:52AM) for the Kubernetes API at https://api.edgeclusterokd.edgeaiokd.clusters.openshiftcorp.com:6443...
INFO API v1.26.4-2835+7d221229dc9796-dirty up
INFO Waiting up to 30m0s (until 2:09AM) for bootstrapping to complete...
INFO Pulling VM console logs
INFO Pulling debug logs from the bootstrap machine
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
ERROR Cluster operator network Degraded is True with RolloutHung: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-7j854 is in CrashLoopBackOff State
ERROR DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-kllcn is in CrashLoopBackOff State
ERROR DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-xw7hg is in CrashLoopBackOff State
ERROR DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2023-06-21T05:39:21Z
INFO Cluster operator network ManagementStateDegraded is False with :
INFO Cluster operator network Progressing is True with Deploying: DaemonSet "/openshift-multus/network-metrics-daemon" is waiting for other operators to become ready
INFO DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 3 nodes)
INFO DaemonSet "/openshift-network-diagnostics/network-check-target" is waiting for other operators to become ready
INFO Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
INFO Deployment "/openshift-cloud-network-config-controller/cloud-network-config-controller" is waiting for other operators to become ready
INFO Deployment "/openshift-multus/multus-admission-controller" is waiting for other operators to become ready
ERROR Bootstrap failed to complete: timed out waiting for the condition
ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
WARNING The bootstrap machine is unable to resolve API and/or API-Int Server URLs
INFO Error: error while checking pod status: timed out waiting for the condition
INFO Using /opt/openshift/auth/kubeconfig as KUBECONFIG
INFO Gathering cluster resources ...

OVN Pod Logs

Please find the tar file for bootstrap logs

log-bundle-20230620151915.tar.gz

srinivasabb · 2023-06-24T12:39:13Z

Hi any update on this issue ?

ArthurVardevanyan · 2023-06-26T23:23:23Z

Error:

ovs-ctl[1473]: id: 'openvswitch': no such user

I was able to finish the installation by following:
https://access.redhat.com/solutions/3494661

I had to do this on the masters and workers.

I also had to overide the MCP Config Annotations on the three masters in order to finish the installation.

srinivasabb · 2023-06-27T02:02:20Z

hi @ArthurVardevanyan Thanks for the comment. Can you let me know the steps you followed to solve this ?The link doesnt show any fix [https://access.redhat.com/solutions/3494661] And did you face this in 4.13 ? Also my worker node didnt come up only.

knthm · 2023-08-30T09:29:58Z

I have this problem deploying RHCOS 4.14-9.2 on libvirt.

The root cause seems to be the systemd-sysusers service configuration:

Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating group 'hugetlbfs' with GID 978.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating group 'openvswitch' with GID 977.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating group 'unbound' with GID 976.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating user 'openvswitch' (Open vSwitch Daemons) with UID 977 and GID 977.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating user 'unbound' (Unbound DNS resolver) with UID 976 and GID 976.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: /etc/gshadow: Group "unbound" already exists.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd[1]: systemd-sysusers.service: Main process exited, code=exited, status=1/FAILURE
Aug 30 08:59:12 clust-hxhnn-master-0 systemd[1]: systemd-sysusers.service: Failed with result 'exit-code'.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd[1]: Failed to start systemd-sysusers.service - Create System Users.

There are duplicate entries in /usr/lib/sysusers.d for the unbound and openvswitch services, causing the service to terminate before it was able to create the necessary users and groups.

I'm guessing this is a regression from RHCOS moving to the current Fedora 38 base: both the RHCOS rpmostree and Fedora 38 openvswitch and unbound package configurations are clashing.
This'll most likely need to be resolved upstream in the RHCOS build configuration.

zhengxiaomei123 · 2023-10-20T09:09:24Z

I have this problem deploying RHCOS 4.14-9.2 on libvirt.

The root cause seems to be the systemd-sysusers service configuration:
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating group 'hugetlbfs' with GID 978.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating group 'openvswitch' with GID 977.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating group 'unbound' with GID 976.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating user 'openvswitch' (Open vSwitch Daemons) with UID 977 and GID 977.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: Creating user 'unbound' (Unbound DNS resolver) with UID 976 and GID 976.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd-sysusers[735]: /etc/gshadow: Group "unbound" already exists.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd[1]: systemd-sysusers.service: Main process exited, code=exited, status=1/FAILURE
Aug 30 08:59:12 clust-hxhnn-master-0 systemd[1]: systemd-sysusers.service: Failed with result 'exit-code'.
Aug 30 08:59:12 clust-hxhnn-master-0 systemd[1]: Failed to start systemd-sysusers.service - Create System Users.
There are duplicate entries in /usr/lib/sysusers.d for the unbound and openvswitch services, causing the service to terminate before it was able to create the necessary users and groups.

I'm guessing this is a regression from RHCOS moving to the current Fedora 38 base: both the RHCOS rpmostree and Fedora 38 openvswitch and unbound package configurations are clashing. This'll most likely need to be resolved upstream in the RHCOS build configuration.

I meet the same error :

Oct 19 03:29:26 test1-mkqrw-master-0 systemd-sysusers[738]: Creating group 'hugetlbfs' with GID 978.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd-sysusers[738]: Creating group 'openvswitch' with GID 977.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd-sysusers[738]: Creating group 'unbound' with GID 976.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd-sysusers[738]: Creating user 'openvswitch' (Open vSwitch Daemons) with UID 977 and GID 977.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd-sysusers[738]: Creating user 'unbound' (Unbound DNS resolver) with UID 976 and GID 976.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd-sysusers[738]: /etc/gshadow: Group "openvswitch" already exists.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd[1]: systemd-sysusers.service: Main process exited, code=exited, status=1/FAILURE
Oct 19 03:29:26 test1-mkqrw-master-0 systemd[1]: systemd-sysusers.service: Failed with result 'exit-code'.
Oct 19 03:29:26 test1-mkqrw-master-0 systemd[1]: Failed to start systemd-sysusers.service - Create System Users.

I saw your comment here: #7265 (comment)
Any workaround for this? Thanks.

knthm · 2023-10-20T16:23:37Z

@zhengxiaomei123 OpenShift libvirt IPI seems to be more or less unmaintained. This unfortunately isn't reflected in the documentation within this repo.

I ended up installing OpenShift following the bare-metal UPI instructions inside my libvirt environment.

zhengxiaomei123 · 2023-10-23T06:35:38Z

@zhengxiaomei123 OpenShift libvirt IPI seems to be more or less unmaintained. This unfortunately isn't reflected in the documentation within this repo.

I ended up installing OpenShift following the bare-metal UPI instructions inside my libvirt environment.

Thanks very much. I will try the bare-metal UPI way too.

openshift-bot · 2024-01-21T09:00:22Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2024-02-21T00:30:34Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2024-03-22T08:00:38Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2024-03-22T08:02:14Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knthm mentioned this issue Aug 30, 2023

More openvswitch woes openshift/os#1274

Open

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2024

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2024

openshift-ci bot closed this as completed Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster operator network Degraded is True with RolloutHung: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-7j854 is in CrashLoopBackOff State #7265

Cluster operator network Degraded is True with RolloutHung: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-7j854 is in CrashLoopBackOff State #7265

srinivasabb commented Jun 21, 2023

srinivasabb commented Jun 24, 2023

ArthurVardevanyan commented Jun 26, 2023

srinivasabb commented Jun 27, 2023 •

edited

Loading

knthm commented Aug 30, 2023

zhengxiaomei123 commented Oct 20, 2023

knthm commented Oct 20, 2023

zhengxiaomei123 commented Oct 23, 2023

openshift-bot commented Jan 21, 2024

openshift-bot commented Feb 21, 2024

openshift-bot commented Mar 22, 2024

openshift-ci bot commented Mar 22, 2024

Cluster operator network Degraded is True with RolloutHung: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-7j854 is in CrashLoopBackOff State #7265

Cluster operator network Degraded is True with RolloutHung: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-7j854 is in CrashLoopBackOff State #7265

Comments

srinivasabb commented Jun 21, 2023

Version: release 4.13 branch

srinivasabb commented Jun 24, 2023

ArthurVardevanyan commented Jun 26, 2023

srinivasabb commented Jun 27, 2023 • edited Loading

knthm commented Aug 30, 2023

zhengxiaomei123 commented Oct 20, 2023

knthm commented Oct 20, 2023

zhengxiaomei123 commented Oct 23, 2023

openshift-bot commented Jan 21, 2024

openshift-bot commented Feb 21, 2024

openshift-bot commented Mar 22, 2024

openshift-ci bot commented Mar 22, 2024

srinivasabb commented Jun 27, 2023 •

edited

Loading