[BUG]: CSI PODMON is tainting the worker node #765
Labels
area/csm-resiliency
Issue pertains to the CSM Resiliency module
type/bug
Something isn't working. This is the default label associated with a bug issue.
Milestone
Bug Description
We are observing that in the CSI version v2.5.0-1-7e431865, Podmon is randomly applying taint to the worker nodes. The tainted node's loss of connectivity is reported in the log. However, there is no disconnection of connectivity found when we examine the storage network's connectivity. This problem began with CSI version 2.5.0 and is affecting all of our critical applications.
Logs
logs.zip
Screenshots
Some more details on what is observed on the logs when issue is observed.
For example, Wn029, has the taint.
(poaccdcaf01)[ccdadmin@POAcCCDADM01 poaccdcaf01]$ kubectl describe node caf-pool06-poaccdcaf01-wn029 | grep Taint
Taints: vxflexos.podmon.storage.dell.com:NoSchedule
Error messages are filled with,
"csi-vxflexos.dellemc.com":"578CD2B2-EFCB-4332-B7C3-320CB50B6055","topolvm.cybozu.com":"caf-pool06-poaccdcaf01-wn029"}"
time="2023-04-15T03:58:36Z" level=info msg="ValidateVolumeHostConnectivity Node caf-pool06-poaccdcaf01-wn029 NodeId 578CD2B2-EFCB-4332-B7C3-320CB50B6055 Connected false"
time="2023-04-15T03:58:36Z" level=info msg="Pod polcafgp2/eric-data-object-storage-mn-mgt-595f7d74c4-skw9x node caf-pool06-poaccdcaf01-wn029 has no connectivity to arrayID default"
time="2023-04-15T03:58:37Z" level=info msg="Tainting node caf-pool06-poaccdcaf01-wn029 because of connectivity loss"
time="2023-04-15T03:58:37Z" level=info msg="Calling to tainting caf-pool06-poaccdcaf01-wn029 with vxflexos.podmon.storage.dell.com NoSchedule (remove = false)"
time="2023-04-15T03:58:37Z" level=info msg="TaintAlreadyExists : vxflexos.podmon.storage.dell.com on node caf-pool06-poaccdcaf01-wn029"
When we got o the node there is no connectivity issue. (Validate by pinging mdm and gateway IP)
eccd@caf-pool06-poaccdcaf01-wn029:
> ip a s ccd_stgfe> ping x.x.x.x26: ccd_stgfe@bond_ccdstgfe: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 5e:a7:f6:b0:00:3f brd ff:ff:ff:ff:ff:ff
inet 172.30.0.69/23 brd x.x.x.x scope global noprefixroute ccd_stgfe
valid_lft forever preferred_lft forever
eccd@caf-pool06-poaccdcaf01-wn029:
PING 172.30.0.14 (x.x.x.x) 56(84) bytes of data.
64 bytes from x.x.x.x: icmp_seq=1 ttl=64 time=0.215 ms
^C
--- 172.30.0.14 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.215/0.215/0.215/0.000 ms
eccd@caf-pool06-poaccdcaf01-wn029:~> ping 172.30.0.15
PING 172.30.0.15 (172.30.0.15) 56(84) bytes of data.
64 bytes from 172.30.0.15: icmp_seq=1 ttl=64 time=0.125 ms
64 bytes from 172.30.0.15: icmp_seq=2 ttl=64 time=0.122 ms
^C
--- 172.30.0.15 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1011ms
rtt min/avg/max/mdev = 0.122/0.123/0.125/0.001 ms
caf-pool06-poaccdcaf01-wn029:~ # /bin/emc/scaleio/drv_cfg --query_mdms
Retrieved 1 mdm(s)
MDM-ID 31f837752eaaa40f SDC ID 8dfe8e6700000031 INSTALLATION ID 5790362551a7f62c IPs [0]-172.30.0.15
Removing the taint manually doesn’t bring it back.
Sometimes we also notice below errors (should be part of full logs provided earlier).
time="2023-04-13T17:19:41Z" level=error msg="could not unmarshal driver path key from nodeid annotation {"topolvm.cybozu.com":"pool08-riccdcaf01-wn042"}: to json: unexpected end of JSON
input"
time="2023-04-13T17:19:44Z" level=error msg="could not unmarshal driver path key from nodeid annotation {"topolvm.cybozu.com":"-ricaf01-wn038"}: to json: unexpected end of JSON
input"
time="2023-04-13T17:19:47Z" level=error msg="could not unmarshal driver path key from nodeid annotation {"topolvm.cybozu.com":"-pool09-ricaf01-wn047"}: to json: unexpected end of JSON
input"
Additional Environment Information
No response
Steps to Reproduce
a. Drain and multiple reboots.
b. Sometimes restarting csi pods also caused it.
Expected Behavior
CSI driver shouldn’t taint the node if it is healthy.
CSM Driver(s)
mdm_config:v1.2.0-19-98abcff9
podmon:v1.4.0-4-7e431865
csi-attacher:v3.5.0-50-98abcff9
csi-provisioner:v3.2.1-50-98abcff9
csi-snapshotter:v6.0.1-50-98abcff9
csi-resizer:v1.5.0-50-98abcff9
csi-vxflexos:v2.5.0-1-7e431865
Installation Type
Operator method, (not helm)
Container Storage Modules Enabled
Podmon : v1.4.0-4-7e431865) is enabled
Container Orchestrator
Ericsson Cloud Container Distribution (ECCD), 2.24.1
Operating System
SUSE Linux Enterprise Server 15 SP4
The text was updated successfully, but these errors were encountered: