[BUG]: CSI node pod crash after replacing OCP ingress certificate or restarting kubectl service #1310
Labels
area/csi-powerflex
Issue pertains to the CSI Driver for Dell EMC PowerFlex
type/bug
Something isn't working. This is the default label associated with a bug issue.
Milestone
Bug Description
CSI node pod crash after replacing OCP ingress certificate or restarting kubectl after 15-16 mins of driver installation
Logs
we install powerflex CSI v2.9.2 on OCP 4.14.x. When we replaced the OCP ingress certificate following this RH document https://docs.openshift.com/container-platform/4.14/security/certificates/replacing-default-ingress-certificate.html,
the CSI node pods (Daemonset) crashed with "CrashLoopBackOff" error, these pods run normally before OCP ingress certificate replacement. And the recovery action to fix this is to restart CSI node pods.
$ oc logs csi-vxflexos-node-46p5t -c registrar
I0402 09:22:26.873562 1 main.go:135] Version: v2.9.1
I0402 09:22:26.873620 1 main.go:136] Running node-driver-registrar in mode=
I0402 09:22:26.873627 1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi_sock"
I0402 09:22:26.873647 1 connection.go:213] Connecting to unix:///csi/csi_sock
I0402 09:22:26.874326 1 main.go:164] Calling CSI driver to discover driver name
I0402 09:22:26.874355 1 connection.go:242] GRPC call: /csi.v1.Identity/GetPluginInfo
I0402 09:22:26.874359 1 connection.go:243] GRPC request: {}
I0402 09:22:26.877563 1 connection.go:249] GRPC response: {"manifest":
{"commit":"dd32cdb24ec41d57fd6011fb35c9c2ac325e1128","formed":"Thu, 22 Feb 2024 07:36:58 UTC","semver":"2.9.2","url":" http://github.com/dell/csi-vxflexos"}
,"name":"csi-vxflexos.dellemc.com","vendor_version":"2.9.2"}
I0402 09:22:26.877588 1 connection.go:250] GRPC error:
I0402 09:22:26.877596 1 main.go:173] CSI driver name: "csi-vxflexos.dellemc.com"
I0402 09:22:26.877642 1 node_register.go:55] Starting Registration Server at: /registration/csi-vxflexos.dellemc.com-reg.sock
I0402 09:22:26.877821 1 node_register.go:64] Registration Server started at: /registration/csi-vxflexos.dellemc.com-reg.sock
I0402 09:22:26.877877 1 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
I0402 09:22:28.840644 1 main.go:90] Received GetInfo call: &InfoRequest{}
I0402 09:22:28.856229 1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error – plugin registration failed with err: rpc error: code = Unknown desc = 401 Unauthorized,}
E0402 09:22:28.856259 1 main.go:103] Registration process failed with error: RegisterPlugin error – plugin registration failed with err: rpc error: code = Unknown desc = 401 Unauthorized, restarting registration container.
And recently I have some new founds. It's easily to be reproduced by restart crio or kubelet service on own node of CSI node pod. And replacing OCP Ingress certificate will trigger either actions so it exposed this issue. I set these 2 ENV in CSI driver container to debug this issue:
and got this log:
time="2024-05-14T06:25:52Z" level=info msg="/csi.v1.Node/NodeGetInfo: REQ 0007: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-05-14T06:25:52Z" level=info msg="Probing all arrays. Number of arrays: 1"
time="2024-05-14T06:25:52Z" level=info msg="default array is set to array ID: 3xxxxxxxxxxxxx50f"
time="2024-05-14T06:25:52Z" level=info msg="3cxxxxxxxxxxxf is the default array, skipping VolumePrefixToSystems map update. \n"
time="2024-05-14T06:25:52Z" level=info msg="array 3xxxxxxxxxxxxx50f probed successfully"
time="2024-05-14T06:25:52Z" level=debug msg="\n -------------------------- GOSCALEIO HTTP REQUEST -------------------------\n GET /api/version HTTP/1.1\n Host: 2xxxxxxx4\n \n"
time="2024-05-14T06:25:52Z" level=debug msg="\n -------------------------- GOSCALEIO HTTP RESPONSE -------------------------\n HTTP/1.1 401 Unauthorized\n Content-Length: 172\n Connection: keep-alive\n Content-Type: text/html\n Date: Tue, 14 May 2024 06:25:52 GMT\n Strict-Transport-Security: max-age=15724800; includeSubDomains\n \n \n <title>401 Authorization Required</title>\n \n
401 Authorization Required
\nnginx\n \n "
time="2024-05-14T06:25:52Z" level=info msg="/csi.v1.Node/NodeGetInfo: REP 0007: 401 Unauthorized"
so it seems hit error in this line:
https://github.com/dell/csi-powerflex/blob/dd32cdb24ec41d57fd6011fb35c9c2ac325e1128/service/service.go#L497
I tried to install vxflexos CSI v2.10.0 and still hit the same issue. It seems either CSI driver or goscaleio code needs to handle this special case. Please help on this. Thanks.
Screenshots
No response
Additional Environment Information
No response
Steps to Reproduce
Restart kubectl after 15-16 mins FROM driver installation
Expected Behavior
Driver should be in running state
CSM Driver(s)
csi-powerflex
Installation Type
helm
Container Storage Modules Enabled
No response
Container Orchestrator
K8, ocp 4.14
Operating System
rhel, core os
The text was updated successfully, but these errors were encountered: