[BUG]: CSI PODMON is tainting the worker node #765

kumarnir · 2023-04-17T15:46:05Z

Bug Description

We are observing that in the CSI version v2.5.0-1-7e431865, Podmon is randomly applying taint to the worker nodes. The tainted node's loss of connectivity is reported in the log. However, there is no disconnection of connectivity found when we examine the storage network's connectivity. This problem began with CSI version 2.5.0 and is affecting all of our critical applications.

Logs

logs.zip

Screenshots

Some more details on what is observed on the logs when issue is observed.

 For example, Wn029, has the taint.

(poaccdcaf01)[ccdadmin@POAcCCDADM01 poaccdcaf01]$ kubectl describe node caf-pool06-poaccdcaf01-wn029 | grep Taint
Taints: vxflexos.podmon.storage.dell.com:NoSchedule

 Error messages are filled with,
"csi-vxflexos.dellemc.com":"578CD2B2-EFCB-4332-B7C3-320CB50B6055","topolvm.cybozu.com":"caf-pool06-poaccdcaf01-wn029"}"
time="2023-04-15T03:58:36Z" level=info msg="ValidateVolumeHostConnectivity Node caf-pool06-poaccdcaf01-wn029 NodeId 578CD2B2-EFCB-4332-B7C3-320CB50B6055 Connected false"
time="2023-04-15T03:58:36Z" level=info msg="Pod polcafgp2/eric-data-object-storage-mn-mgt-595f7d74c4-skw9x node caf-pool06-poaccdcaf01-wn029 has no connectivity to arrayID default"
time="2023-04-15T03:58:37Z" level=info msg="Tainting node caf-pool06-poaccdcaf01-wn029 because of connectivity loss"
time="2023-04-15T03:58:37Z" level=info msg="Calling to tainting caf-pool06-poaccdcaf01-wn029 with vxflexos.podmon.storage.dell.com NoSchedule (remove = false)"
time="2023-04-15T03:58:37Z" level=info msg="TaintAlreadyExists : vxflexos.podmon.storage.dell.com on node caf-pool06-poaccdcaf01-wn029"

 When we got o the node there is no connectivity issue. (Validate by pinging mdm and gateway IP)

eccd@caf-pool06-poaccdcaf01-wn029:> ip a s ccd_stgfe
26: ccd_stgfe@bond_ccdstgfe: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 5e:a7:f6:b0:00:3f brd ff:ff:ff:ff:ff:ff
inet 172.30.0.69/23 brd x.x.x.x scope global noprefixroute ccd_stgfe
valid_lft forever preferred_lft forever
eccd@caf-pool06-poaccdcaf01-wn029:> ping x.x.x.x
PING 172.30.0.14 (x.x.x.x) 56(84) bytes of data.
64 bytes from x.x.x.x: icmp_seq=1 ttl=64 time=0.215 ms
^C
--- 172.30.0.14 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.215/0.215/0.215/0.000 ms
eccd@caf-pool06-poaccdcaf01-wn029:~> ping 172.30.0.15
PING 172.30.0.15 (172.30.0.15) 56(84) bytes of data.
64 bytes from 172.30.0.15: icmp_seq=1 ttl=64 time=0.125 ms
64 bytes from 172.30.0.15: icmp_seq=2 ttl=64 time=0.122 ms
^C
--- 172.30.0.15 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1011ms
rtt min/avg/max/mdev = 0.122/0.123/0.125/0.001 ms

caf-pool06-poaccdcaf01-wn029:~ # /bin/emc/scaleio/drv_cfg --query_mdms
Retrieved 1 mdm(s)
MDM-ID 31f837752eaaa40f SDC ID 8dfe8e6700000031 INSTALLATION ID 5790362551a7f62c IPs [0]-172.30.0.15

 Removing the taint manually doesn’t bring it back.
 Sometimes we also notice below errors (should be part of full logs provided earlier).

time="2023-04-13T17:19:41Z" level=error msg="could not unmarshal driver path key from nodeid annotation {"topolvm.cybozu.com":"pool08-riccdcaf01-wn042"}: to json: unexpected end of JSON
input"
time="2023-04-13T17:19:44Z" level=error msg="could not unmarshal driver path key from nodeid annotation {"topolvm.cybozu.com":"-ricaf01-wn038"}: to json: unexpected end of JSON
input"
time="2023-04-13T17:19:47Z" level=error msg="could not unmarshal driver path key from nodeid annotation {"topolvm.cybozu.com":"-pool09-ricaf01-wn047"}: to json: unexpected end of JSON
input"

Additional Environment Information

No response

Steps to Reproduce

a. Drain and multiple reboots.
b. Sometimes restarting csi pods also caused it.

Expected Behavior

CSI driver shouldn’t taint the node if it is healthy.

CSM Driver(s)

mdm_config:v1.2.0-19-98abcff9
podmon:v1.4.0-4-7e431865
csi-attacher:v3.5.0-50-98abcff9
csi-provisioner:v3.2.1-50-98abcff9
csi-snapshotter:v6.0.1-50-98abcff9
csi-resizer:v1.5.0-50-98abcff9
csi-vxflexos:v2.5.0-1-7e431865

Installation Type

Operator method, (not helm)

Container Storage Modules Enabled

Podmon : v1.4.0-4-7e431865) is enabled

Container Orchestrator

Ericsson Cloud Container Distribution (ECCD), 2.24.1

Operating System

SUSE Linux Enterprise Server 15 SP4

csmbot · 2023-04-17T15:46:51Z

@kumarnir: Thank you for submitting this issue!

The issue is currently awaiting triage. Please make sure you have given us as much context as possible.

If the maintainers determine this is a relevant issue, they will remove the needs-triage label and assign an appropriate priority label.

We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at container.storage.modules@dell.com.

shaynafinocchiaro · 2023-05-31T13:19:25Z

Closing issue as a fix has been provided. Please reopen if the issue persists.

kumarnir added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Apr 17, 2023

sharmilarama added the area/csm-resiliency Issue pertains to the CSM Resiliency module label Apr 19, 2023

sharmilarama assigned alikdell Apr 19, 2023

alikdell removed the needs-triage Issue requires triage. label Apr 28, 2023

alikdell mentioned this issue May 17, 2023

Monitor node events for addition-deletion and update pod detail on modify event of pod dell/karavi-resiliency#160

Merged

10 tasks

shaynafinocchiaro added this to the v1.7.0 milestone May 17, 2023

shaynafinocchiaro closed this as completed May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: CSI PODMON is tainting the worker node #765

[BUG]: CSI PODMON is tainting the worker node #765

kumarnir commented Apr 17, 2023

csmbot commented Apr 17, 2023

shaynafinocchiaro commented May 31, 2023

[BUG]: CSI PODMON is tainting the worker node #765

[BUG]: CSI PODMON is tainting the worker node #765

Comments

kumarnir commented Apr 17, 2023

Bug Description

Logs

Screenshots

Additional Environment Information

Steps to Reproduce

Expected Behavior

CSM Driver(s)

Installation Type

Container Storage Modules Enabled

Container Orchestrator

Operating System

csmbot commented Apr 17, 2023

shaynafinocchiaro commented May 31, 2023