Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: CSI node pod crash after replacing OCP ingress certificate or restarting kubectl service #1310

Closed
adarsh-dell opened this issue May 28, 2024 · 1 comment
Assignees
Labels
area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex type/bug Something isn't working. This is the default label associated with a bug issue.
Milestone

Comments

@adarsh-dell
Copy link
Contributor

Bug Description

CSI node pod crash after replacing OCP ingress certificate or restarting kubectl after 15-16 mins of driver installation

Logs

we install powerflex CSI v2.9.2 on OCP 4.14.x. When we replaced the OCP ingress certificate following this RH document https://docs.openshift.com/container-platform/4.14/security/certificates/replacing-default-ingress-certificate.html,
the CSI node pods (Daemonset) crashed with "CrashLoopBackOff" error, these pods run normally before OCP ingress certificate replacement. And the recovery action to fix this is to restart CSI node pods.

$ oc logs csi-vxflexos-node-46p5t -c registrar
I0402 09:22:26.873562 1 main.go:135] Version: v2.9.1
I0402 09:22:26.873620 1 main.go:136] Running node-driver-registrar in mode=
I0402 09:22:26.873627 1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi_sock"
I0402 09:22:26.873647 1 connection.go:213] Connecting to unix:///csi/csi_sock
I0402 09:22:26.874326 1 main.go:164] Calling CSI driver to discover driver name
I0402 09:22:26.874355 1 connection.go:242] GRPC call: /csi.v1.Identity/GetPluginInfo
I0402 09:22:26.874359 1 connection.go:243] GRPC request: {}
I0402 09:22:26.877563 1 connection.go:249] GRPC response: {"manifest":

{"commit":"dd32cdb24ec41d57fd6011fb35c9c2ac325e1128","formed":"Thu, 22 Feb 2024 07:36:58 UTC","semver":"2.9.2","url":" http://github.com/dell/csi-vxflexos"}
,"name":"csi-vxflexos.dellemc.com","vendor_version":"2.9.2"}
I0402 09:22:26.877588 1 connection.go:250] GRPC error:
I0402 09:22:26.877596 1 main.go:173] CSI driver name: "csi-vxflexos.dellemc.com"
I0402 09:22:26.877642 1 node_register.go:55] Starting Registration Server at: /registration/csi-vxflexos.dellemc.com-reg.sock
I0402 09:22:26.877821 1 node_register.go:64] Registration Server started at: /registration/csi-vxflexos.dellemc.com-reg.sock
I0402 09:22:26.877877 1 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
I0402 09:22:28.840644 1 main.go:90] Received GetInfo call: &InfoRequest{}
I0402 09:22:28.856229 1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error – plugin registration failed with err: rpc error: code = Unknown desc = 401 Unauthorized,}
E0402 09:22:28.856259 1 main.go:103] Registration process failed with error: RegisterPlugin error – plugin registration failed with err: rpc error: code = Unknown desc = 401 Unauthorized, restarting registration container.

And recently I have some new founds. It's easily to be reproduced by restart crio or kubelet service on own node of CSI node pod. And replacing OCP Ingress certificate will trigger either actions so it exposed this issue. I set these 2 ENV in CSI driver container to debug this issue:

  - name: GOSCALEIO_DEBUG
      value: "true"
    - name: GOSCALEIO_SHOWHTTP
      value: "true"

and got this log:

time="2024-05-14T06:25:52Z" level=info msg="/csi.v1.Node/NodeGetInfo: REQ 0007: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-05-14T06:25:52Z" level=info msg="Probing all arrays. Number of arrays: 1"
time="2024-05-14T06:25:52Z" level=info msg="default array is set to array ID: 3xxxxxxxxxxxxx50f"
time="2024-05-14T06:25:52Z" level=info msg="3cxxxxxxxxxxxf is the default array, skipping VolumePrefixToSystems map update. \n"
time="2024-05-14T06:25:52Z" level=info msg="array 3xxxxxxxxxxxxx50f probed successfully"
time="2024-05-14T06:25:52Z" level=debug msg="\n -------------------------- GOSCALEIO HTTP REQUEST -------------------------\n GET /api/version HTTP/1.1\n Host: 2xxxxxxx4\n \n"
time="2024-05-14T06:25:52Z" level=debug msg="\n -------------------------- GOSCALEIO HTTP RESPONSE -------------------------\n HTTP/1.1 401 Unauthorized\n Content-Length: 172\n Connection: keep-alive\n Content-Type: text/html\n Date: Tue, 14 May 2024 06:25:52 GMT\n Strict-Transport-Security: max-age=15724800; includeSubDomains\n \n \n <title>401 Authorization Required</title>\n \n

401 Authorization Required

\n
nginx\n \n "
time="2024-05-14T06:25:52Z" level=info msg="/csi.v1.Node/NodeGetInfo: REP 0007: 401 Unauthorized"

so it seems hit error in this line:
https://github.com/dell/csi-powerflex/blob/dd32cdb24ec41d57fd6011fb35c9c2ac325e1128/service/service.go#L497

I tried to install vxflexos CSI v2.10.0 and still hit the same issue. It seems either CSI driver or goscaleio code needs to handle this special case. Please help on this. Thanks.

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

Restart kubectl after 15-16 mins FROM driver installation

Expected Behavior

Driver should be in running state

CSM Driver(s)

csi-powerflex

Installation Type

helm

Container Storage Modules Enabled

No response

Container Orchestrator

K8, ocp 4.14

Operating System

rhel, core os

@adarsh-dell adarsh-dell added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex and removed needs-triage Issue requires triage. labels May 28, 2024
@adarsh-dell adarsh-dell added this to the v1.11.0 milestone May 28, 2024
@adarsh-dell
Copy link
Contributor Author

Will be available in CSM 1.11.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex type/bug Something isn't working. This is the default label associated with a bug issue.
Projects
None yet
Development

No branches or pull requests

2 participants