Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: CSI PODMON is tainting the worker node #765

Closed
kumarnir opened this issue Apr 17, 2023 · 2 comments
Closed

[BUG]: CSI PODMON is tainting the worker node #765

kumarnir opened this issue Apr 17, 2023 · 2 comments
Assignees
Labels
area/csm-resiliency Issue pertains to the CSM Resiliency module type/bug Something isn't working. This is the default label associated with a bug issue.
Milestone

Comments

@kumarnir
Copy link

Bug Description

We are observing that in the CSI version v2.5.0-1-7e431865, Podmon is randomly applying taint to the worker nodes. The tainted node's loss of connectivity is reported in the log. However, there is no disconnection of connectivity found when we examine the storage network's connectivity. This problem began with CSI version 2.5.0 and is affecting all of our critical applications.

Logs

logs.zip

Screenshots

Some more details on what is observed on the logs when issue is observed.

For example, Wn029, has the taint.

(poaccdcaf01)[ccdadmin@POAcCCDADM01 poaccdcaf01]$ kubectl describe node caf-pool06-poaccdcaf01-wn029 | grep Taint
Taints: vxflexos.podmon.storage.dell.com:NoSchedule

Error messages are filled with,
"csi-vxflexos.dellemc.com":"578CD2B2-EFCB-4332-B7C3-320CB50B6055","topolvm.cybozu.com":"caf-pool06-poaccdcaf01-wn029"}"
time="2023-04-15T03:58:36Z" level=info msg="ValidateVolumeHostConnectivity Node caf-pool06-poaccdcaf01-wn029 NodeId 578CD2B2-EFCB-4332-B7C3-320CB50B6055 Connected false"
time="2023-04-15T03:58:36Z" level=info msg="Pod polcafgp2/eric-data-object-storage-mn-mgt-595f7d74c4-skw9x node caf-pool06-poaccdcaf01-wn029 has no connectivity to arrayID default"
time="2023-04-15T03:58:37Z" level=info msg="Tainting node caf-pool06-poaccdcaf01-wn029 because of connectivity loss"
time="2023-04-15T03:58:37Z" level=info msg="Calling to tainting caf-pool06-poaccdcaf01-wn029 with vxflexos.podmon.storage.dell.com NoSchedule (remove = false)"
time="2023-04-15T03:58:37Z" level=info msg="TaintAlreadyExists : vxflexos.podmon.storage.dell.com on node caf-pool06-poaccdcaf01-wn029"

When we got o the node there is no connectivity issue. (Validate by pinging mdm and gateway IP)

eccd@caf-pool06-poaccdcaf01-wn029:> ip a s ccd_stgfe
26: ccd_stgfe@bond_ccdstgfe: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 5e:a7:f6:b0:00:3f brd ff:ff:ff:ff:ff:ff
inet 172.30.0.69/23 brd x.x.x.x scope global noprefixroute ccd_stgfe
valid_lft forever preferred_lft forever
eccd@caf-pool06-poaccdcaf01-wn029:
> ping x.x.x.x
PING 172.30.0.14 (x.x.x.x) 56(84) bytes of data.
64 bytes from x.x.x.x: icmp_seq=1 ttl=64 time=0.215 ms
^C
--- 172.30.0.14 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.215/0.215/0.215/0.000 ms
eccd@caf-pool06-poaccdcaf01-wn029:~> ping 172.30.0.15
PING 172.30.0.15 (172.30.0.15) 56(84) bytes of data.
64 bytes from 172.30.0.15: icmp_seq=1 ttl=64 time=0.125 ms
64 bytes from 172.30.0.15: icmp_seq=2 ttl=64 time=0.122 ms
^C
--- 172.30.0.15 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1011ms
rtt min/avg/max/mdev = 0.122/0.123/0.125/0.001 ms

caf-pool06-poaccdcaf01-wn029:~ # /bin/emc/scaleio/drv_cfg --query_mdms
Retrieved 1 mdm(s)
MDM-ID 31f837752eaaa40f SDC ID 8dfe8e6700000031 INSTALLATION ID 5790362551a7f62c IPs [0]-172.30.0.15

 Removing the taint manually doesn’t bring it back.
 Sometimes we also notice below errors (should be part of full logs provided earlier).

time="2023-04-13T17:19:41Z" level=error msg="could not unmarshal driver path key from nodeid annotation {"topolvm.cybozu.com":"pool08-riccdcaf01-wn042"}: to json: unexpected end of JSON
input"
time="2023-04-13T17:19:44Z" level=error msg="could not unmarshal driver path key from nodeid annotation {"topolvm.cybozu.com":"-ricaf01-wn038"}: to json: unexpected end of JSON
input"
time="2023-04-13T17:19:47Z" level=error msg="could not unmarshal driver path key from nodeid annotation {"topolvm.cybozu.com":"-pool09-ricaf01-wn047"}: to json: unexpected end of JSON
input"

Additional Environment Information

No response

Steps to Reproduce

a. Drain and multiple reboots.
b. Sometimes restarting csi pods also caused it.

Expected Behavior

CSI driver shouldn’t taint the node if it is healthy.

CSM Driver(s)

mdm_config:v1.2.0-19-98abcff9
podmon:v1.4.0-4-7e431865
csi-attacher:v3.5.0-50-98abcff9
csi-provisioner:v3.2.1-50-98abcff9
csi-snapshotter:v6.0.1-50-98abcff9
csi-resizer:v1.5.0-50-98abcff9
csi-vxflexos:v2.5.0-1-7e431865

Installation Type

Operator method, (not helm)

Container Storage Modules Enabled

Podmon : v1.4.0-4-7e431865) is enabled

Container Orchestrator

Ericsson Cloud Container Distribution (ECCD), 2.24.1

Operating System

SUSE Linux Enterprise Server 15 SP4

@kumarnir kumarnir added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Apr 17, 2023
@csmbot
Copy link
Collaborator

csmbot commented Apr 17, 2023

@kumarnir: Thank you for submitting this issue!

The issue is currently awaiting triage. Please make sure you have given us as much context as possible.

If the maintainers determine this is a relevant issue, they will remove the needs-triage label and assign an appropriate priority label.


We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at container.storage.modules@dell.com.

@shaynafinocchiaro
Copy link
Collaborator

Closing issue as a fix has been provided. Please reopen if the issue persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csm-resiliency Issue pertains to the CSM Resiliency module type/bug Something isn't working. This is the default label associated with a bug issue.
Projects
None yet
Development

No branches or pull requests

5 participants