Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race issue after node reboot #1221

Open
SchSeba opened this issue Feb 1, 2024 · 25 comments · May be fixed by #1213
Open

Race issue after node reboot #1221

SchSeba opened this issue Feb 1, 2024 · 25 comments · May be fixed by #1213

Comments

@SchSeba
Copy link
Contributor

SchSeba commented Feb 1, 2024

Hi, it looks like there is an issue after a node reboot where we can have a race in multus that will prevent the pod from starting

kubectl -n kube-system logs -f kube-multus-ds-ml62q -c install-multus-binary
cp: cannot create regular file '/host/opt/cni/bin/multus-shim': Text file busy

The problem is mainly after reboot that the multus-shim gets called by crio to start pods but the multus pod is not able to start because the init container fails to cp the shim.
The reason it failed to copy is because crio called the shim who is stuck waiting for the communication with the pod

[root@virtual-worker-0 centos]# lsof /opt/cni/bin/multus-shim
COMMAND    PID USER  FD   TYPE DEVICE SIZE/OFF     NODE NAME
multus-sh 8682 root txt    REG  252,1 46760102 46241656 /opt/cni/bin/multus-shim
[root@virtual-worker-0 centos]# ps -ef | grep mult
root        8682     936  0 16:27 ?        00:00:00 /opt/cni/bin/multus-shim
root        9082    7247  0 16:28 pts/0    00:00:00 grep --color=auto mult
@SchSeba
Copy link
Contributor Author

SchSeba commented Feb 1, 2024

[root@virtual-worker-0 centos]# ps -ef | grep 942
root         942       1  5 17:07 ?        00:00:00 /usr/bin/crio
root        1246     942  0 17:07 ?        00:00:00 /opt/cni/bin/multus-shim
root        2745    2395  0 17:08 pts/0    00:00:00 grep --color=auto 942

from crio:

from CNI network \"multus-cni-network\": plugin type=\"multus-shim\" name=\"multus-cni-network\" failed (delete): netplugin failed with no error message: signal: killed"

@SchSeba
Copy link
Contributor Author

SchSeba commented Feb 1, 2024

just update doing -f looks like fix the issue in the copy command

@rrpolanco
Copy link

rrpolanco commented Feb 2, 2024

Coincidentally, we also saw this error crop up yesterday with one of our edge clusters after rebooting.

@adrianchiris
Copy link
Contributor

adrianchiris commented Feb 4, 2024

As an FYI i see different deployment yamls use different way to copy the cni binary in init container:

the first one[1] will use install_multus which will copy files in an atomic manner. the latter[2] will just use cp.
(install_multus support both thick and thin plugin types)

although im not sure that copying file atomically will solve the above issue.

see:
[1]

command: ["/install_multus"]

and
[2]
https://github.com/k8snetworkplumbingwg/multus-cni/blob/8e5060b9a7612044b7bf927365bbdbb8f6cde451/deployments/multus-daemonset-thick.yml#L199C9-L204C46

also deployments/multus-daemonset-crio.yml does not use init contianer.

@dougbtv
Copy link
Member

dougbtv commented Feb 15, 2024

This should hopefully be addressed with #1213

@kfox1111
Copy link

Saw this in minikube today. No rebooting, just staring up a new minikube cluster.

@dougbtv
Copy link
Member

dougbtv commented Apr 2, 2024

I also got a reproduction after rebooting a node and having multus restart.

I mitigated it by deleting /opt/cni/bin/multus-shim, but, yeah, I'll retest with the above patch

[fedora@labkubedualhost-master-1 whereabouts]$ watch -n1 kubectl get pods -A -o wide
[fedora@labkubedualhost-master-1 whereabouts]$ kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-daemon-config created
daemonset.apps/kube-multus-ds created
[fedora@labkubedualhost-master-1 whereabouts]$ watch -n1 kubectl get pods -A -o wide
[fedora@labkubedualhost-master-1 whereabouts]$ kubectl logs kube-multus-ds-fzdcr -n kube-system
Defaulted container "kube-multus" out of: kube-multus, install-multus-binary (init)
Error from server (BadRequest): container "kube-multus" in pod "kube-multus-ds-fzdcr" is waiting to start: PodInitializing

@dustinrouillard
Copy link

Seems I can make this happen anytime I ungracefully restart a node, worker or master it creates this error and stops pod network sandbox recreation completely on that node.

The fix mentioned above does work, but this likely means a power outage of a node will require manual intervention whereas otherwise without multus this is not required, this error should be handled properly.

@kfox1111
Copy link

+1. This seems like a pretty serious issue. Can we get a fix merge for it soon please?

@tomroffe
Copy link

tomroffe commented Jun 7, 2024

Additionally can confirm this behavior. as @dougbtv mentioned... removing /opt/cni/bin/multus-shim works as a workaround.

@ulbi
Copy link

ulbi commented Jun 15, 2024

+1 happend to me as well, cluster did not come up. Any chance to fix this soon?

@stefb69
Copy link

stefb69 commented Jun 18, 2024

same here, cluster kubespray 1.29

@javen-yan
Copy link

Certainly need to fix right away.

@haiwu
Copy link

haiwu commented Aug 16, 2024

@dougbtv : Hit exactly the same issue. It helps by deleting /opt/cni/bin/multus-shim.

when could this be fixed?

@reski-rukmantiyo
Copy link

Hit the same issue with kube-ovn. Already posted it there (kubeovn/kube-ovn#4470)
Only happen when I force delete the kube-ovn pod.

@adampetrovic
Copy link

Also hit me today on a node that crashed.

Any indicator this fix is going to be picked up any time soon?

@iSenne
Copy link

iSenne commented Oct 6, 2024

Had the same problem today, having a Talos kubernetes cluster. I modified the kube-multus-ds init containers to check for existing multus-shim file

Original command

command:
  - cp
  - /usr/src/multus-cni/bin/multus-shim
  - /host/opt/cni/bin/multus-shim

New command

command:
 - sh
 - -c
 - |
   if [ ! -f /host/opt/cni/bin/multus-shim ]; then
     cp /usr/src/multus-cni/bin/multus-shim /host/opt/cni/bin/multus-shim;
   fi

This worked for me 👍

@reski-rukmantiyo
Copy link

thanks @iSenne already use this code in my kubernetes cluster.
hopefuly it will fixed the issue

@adampetrovic
Copy link

adampetrovic commented Oct 6, 2024

Had the same problem today, having a Talos kubernetes cluster. I modified the kube-multus-ds init containers to check for existing multus-shim file

Original command


command:

  - cp

  - /usr/src/multus-cni/bin/multus-shim

  - /host/opt/cni/bin/multus-shim

New command


command:

 - sh

 - -c

 - |

   if [ ! -f /host/opt/cni/bin/multus-shim ]; then

     cp /usr/src/multus-cni/bin/multus-shim /host/opt/cni/bin/multus-shim;

   fi

This worked for me 👍

cp -f is arguably more correct.

Upgrading multus would mean you have an old shim file if you check for existence

@oujonny
Copy link

oujonny commented Oct 8, 2024

Hey we are also really blocked by this issue. What can we do to push this forward?

@adampetrovic
Copy link

An immediate mitigation that will get Multus running temporarily is to edit the DaemonSet directly and modify the cp command to add -f.

kubectl edit DaemonSet/multus -n <namespace>

scroll to the multus-installer initContainer
edit the cp command and add -f
cycle the pods

@kfox1111
Copy link

kfox1111 commented Oct 8, 2024

I think the concern at this point for folks wanting to use multus, is not about having a "workaround", but the seeming inability to get a fix up-streamed, leading to questions about the health of the multus project.

#1213 for example, has been open since Jan 18, and hasn't gotten any comment since Aug 12.

Please don't see this comment as knocking the devs hard work. It is very much appreciated, really. Just trying to gauge the health of the project though.

@dustinrouillard
Copy link

So crazy this has been ignored by maintainers this long. 🙄

@tjwallace
Copy link

tjwallace commented Oct 15, 2024

FYI: I made a PR to add the -f flag to the angelnu/multus chart

@kub3let
Copy link

kub3let commented Nov 24, 2024

This has been bothering me for quite some time, whenever I do node maintenance the whole cluster does not come up and I have to

# on each node
rm /opt/cni/bin/multus-shim

# afterwards delete the multus pods so they get redeployed
k delete -n kube-system pod/kube-multus-ds-*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet