Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node driver registrar says "Lost connection" and do nothing #139

Open
jim3ma opened this issue Mar 26, 2021 · 20 comments
Open

Node driver registrar says "Lost connection" and do nothing #139

jim3ma opened this issue Mar 26, 2021 · 20 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@jim3ma
Copy link

jim3ma commented Mar 26, 2021

Log:

I0323 21:20:55.915743       1 main.go:110] Version: unknown
I0323 21:20:55.915816       1 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0323 21:20:55.915833       1 connection.go:151] Connecting to unix:///csi/csi.sock
I0323 21:20:55.916332       1 main.go:127] Calling CSI driver to discover driver name
I0323 21:20:55.916348       1 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I0323 21:20:55.916354       1 connection.go:181] GRPC request: {}
I0323 21:20:55.917907       1 connection.go:183] GRPC response: {"name":"x","vendor_version":"0.3.0"}
I0323 21:20:55.918413       1 connection.go:184] GRPC error: <nil>
I0323 21:20:55.918421       1 main.go:137] CSI driver name: "x"
I0323 21:20:55.918477       1 node_register.go:58] Starting Registration Server at: /registration/x-reg.sock
I0323 21:20:55.918603       1 node_register.go:67] Registration Server started at: /registration/x-reg.sock
I0323 21:20:57.567609       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0323 21:20:57.671681       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E0323 21:21:15.690492       1 connection.go:129] Lost connection to unix:///csi/csi.sock.

After restart container, it works.

k8s-ci-robot pushed a commit that referenced this issue Mar 31, 2021
a1e11275 Merge pull request #139 from pohly/kind-for-kubernetes-latest
1c0fb096 prow.sh: use KinD main for latest Kubernetes
1d77cfcb Merge pull request #138 from pohly/kind-update-0.10
bff2fb7e prow.sh: KinD 0.10.0

git-subtree-dir: release-tools
git-subtree-split: a1e11275b5a4febd6ad21beeac730e22c579825b
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@aramase
Copy link
Contributor

aramase commented Jul 12, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2021
@toshipp
Copy link

toshipp commented Oct 7, 2021

To restart the container automatically, #152 would help you.

@yatanasov-hs
Copy link

I am experiencing the same behavior. Given all 3 containers (node-driver-registrar, secrets-store and liveness-probe) are running, if secrets-store gets restarted, node-driver-registrar loses connection to unix:///csi/csi.sock.

Node-driver-registrar logs:

I1012 11:51:22.299487 1 main.go:164] Version: v2.3.0
I1012 11:51:22.299515 1 main.go:165] Running node-driver-registrar in mode=registration
I1012 11:51:22.299913 1 main.go:189] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1012 11:51:22.299936 1 connection.go:154] Connecting to unix:///csi/csi.sock
I1012 11:51:22.300436 1 main.go:196] Calling CSI driver to discover driver name
I1012 11:51:22.300451 1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
I1012 11:51:22.300456 1 connection.go:184] GRPC request: {}
I1012 11:51:22.303030 1 connection.go:186] GRPC response: {"name":"secrets-store.csi.k8s.io","vendor_version":"v0.3.0"}
I1012 11:51:22.303071 1 connection.go:187] GRPC error:
I1012 11:51:22.303078 1 main.go:206] CSI driver name: "secrets-store.csi.k8s.io"
I1012 11:51:22.303115 1 node_register.go:52] Starting Registration Server at: /registration/secrets-store.csi.k8s.io-reg.sock
I1012 11:51:22.303301 1 node_register.go:61] Registration Server started at: /registration/secrets-store.csi.k8s.io-reg.sock
I1012 11:51:22.303447 1 node_register.go:91] Skipping healthz server because HTTP endpoint is set to: ""
I1012 11:51:23.540046 1 main.go:100] Received GetInfo call: &InfoRequest{}
I1012 11:51:23.540214 1 main.go:107] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi-secrets-store/registration"
I1012 11:51:23.577520 1 main.go:118] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E1012 11:51:25.218944 1 connection.go:132] Lost connection to unix:///csi/csi.sock.

Secret store logs:

I1012 11:53:06.149183 1 exporter.go:33] metrics backend: prometheus
I1012 11:53:06.205017 1 main.go:169] "provider health check enabled" interval="1m0s"
I1012 11:53:06.205636 1 secrets-store.go:40] "Initializing Secrets Store CSI Driver" driver="secrets-store.csi.k8s.io" version="v0.3.0" buildTime="2021-08-30-18:38"
I1012 11:53:06.205852 1 server.go:117] "Listening for connections" address="//csi/csi.sock"
I1012 11:53:06.205923 1 main.go:174] starting manager
I1012 11:53:06.206014 1 reconciler.go:131] starting rotation reconciler with poll interval: 10s

Liveness probe logs:

I1012 11:51:22.555987 1 main.go:149] calling CSI driver to discover driver name
I1012 11:51:22.558382 1 main.go:155] CSI driver name: "secrets-store.csi.k8s.io"
I1012 11:51:22.558401 1 main.go:183] ServeMux listening at "0.0.0.0:9808"
E1012 11:52:05.885988 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
W1012 11:52:12.886013 1 connection.go:173] Still connecting to unix:///csi/csi.sock
E1012 11:52:20.885852 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
W1012 11:52:22.885321 1 connection.go:173] Still connecting to unix:///csi/csi.sock
W1012 11:52:27.886239 1 connection.go:173] Still connecting to unix:///csi/csi.sock
W1012 11:52:32.885371 1 connection.go:173] Still connecting to unix:///csi/csi.sock
E1012 11:52:35.885665 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
W1012 11:52:37.886239 1 connection.go:173] Still connecting to unix:///csi/csi.sock
W1012 11:52:42.885652 1 connection.go:173] Still connecting to unix:///csi/csi.sock
W1012 11:52:42.885679 1 connection.go:173] Still connecting to unix:///csi/csi.sock
W1012 11:52:47.885302 1 connection.go:173] Still connecting to unix:///csi/csi.sock
E1012 11:52:50.885540 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
W1012 11:52:52.885269 1 connection.go:173] Still connecting to unix:///csi/csi.sock
W1012 11:52:52.886357 1 connection.go:173] Still connecting to unix:///csi/csi.sock
W1012 11:52:57.885542 1 connection.go:173] Still connecting to unix:///csi/csi.sock
W1012 11:52:57.885565 1 connection.go:173] Still connecting to unix:///csi/csi.sock
W1012 11:53:02.885251 1 connection.go:173] Still connecting to unix:///csi/csi.sock
W1012 11:53:02.885407 1 connection.go:173] Still connecting to unix:///csi/csi.sock
E1012 11:53:05.885994 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded

To restart the container automatically, #152 would help you.

That does not help, already have this:

  • name: node-driver-registrar
    image: "..."
    args:
    - --v=5
    - --csi-address=/csi/csi.sock
    - --kubelet-registration-path=/var/lib/kubelet/plugins/csi-secrets-store/csi.sock
    livenessProbe:
    exec:
    command:
    - /csi-node-driver-registrar
    - --kubelet-registration-path=/var/lib/kubelet/plugins/csi-secrets-store/csi.sock
    - --mode=kubelet-registration-probe
    initialDelaySeconds: 3

@mauriciopoppe
Copy link
Member

To restart the container automatically, #152 would help you.

This helps during the startup of node-driver-registrar if the kubelet plugin registration fails but these logs show that the kubelet plugin registration succeeded so the workaround in #152 can't help here:

I0323 21:20:57.567609       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0323 21:20:57.671681       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

@yatanasov-hs thanks for the logs, I see that liveness probe attempted to connect after node-driver-registrar logged lost connection, looking at the timestamps:

# node-driver-registrar
E1012 11:51:25.218944 1 connection.go:132] Lost connection to unix:///csi/csi.sock

# livenessprobe
E1012 11:52:05.885988 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
W1012 11:52:12.886013 1 connection.go:173] Still connecting to unix:///csi/csi.sock

The error Lost connection to {{address}} comes from https://github.com/kubernetes-csi/csi-lib-utils/blob/a2ccb594bb74b61a0655d3431c6365a6a567c7af/connection/connection.go?#L132, which by default will attempt to reconnect, I think I can add a log after a successful dial to rule out the dial timeout.

Both node-driver-registrar or livenessprobe should be able to reestablish the connection to csi.sock if they lost it so still not sure why they couldn't.

@chelobarreto
Copy link

I'm having same problem after upgrade okd 4.7 to 4.8.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fradeve
Copy link

fradeve commented May 31, 2023

Hi, I happened to see the same behaviour on k8s 1.24 and quay.io/k8scsi/csi-node-driver-registrar:v1.2.0, any idea of what might be causing this?

@mauriciopoppe
Copy link
Member

/reopen
/remove-lifecycle rotten
/lifecycle frozen
/kind bug

@k8s-ci-robot k8s-ci-robot reopened this Jun 1, 2023
@k8s-ci-robot
Copy link
Contributor

@mauriciopoppe: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten
/lifecycle frozen
/kind bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 1, 2023
@BertelBB
Copy link

I got the same issue on AKS cluster in version 1.27.3. The csi-azurefile-node pods started failing on one of the node pools (1 of 3 node pools). Still looking into what caused it but best guess atm is something failing to startup properly after node upgrade during automatic maintenance window.

image

@Rohlik
Copy link

Rohlik commented Nov 15, 2023

I am also observing this on our AWS EKS 1.27.4. Exactly the same output as the comment above.

@basaveswar-kureti
Copy link

@gnufied Does this PR #322 fixes the current(#139) issue ?

@gnufied
Copy link
Contributor

gnufied commented Dec 13, 2023

It should help with yes.

@basaveswar-kureti
Copy link

Thanks @gnufied for the confirmation.
We tested this internally on a bunch of our k8s clusters which is using csi-node-driver-registrar version of v2.9.2 (so anything >v2.9.0) has not seen this issue.

@mauriciopoppe
Copy link
Member

mauriciopoppe commented Dec 13, 2023

Thanks @gnufied, I wanted to expand on the effect of #322 (released in https://github.com/kubernetes-csi/node-driver-registrar/releases/tag/v2.9.0) and this issue.

The possible scenario where we might see errors like #139 (comment) or #139 (comment):

  • E1012 11:52:05.885988 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
  • W1012 11:52:12.886013 1 connection.go:173] Still connecting to unix:///csi/csi.sock
  • E1012 11:51:25.218944 1 connection.go:132] Lost connection to unix:///csi/csi.sock.

Is if the csi.sock owned by the CSI Driver node component suddenly dies which could happen because the CSI Driver container restarted. Let's assume that the CSI Driver node component was restarted, the usual behavior for node-driver-registrar is to show Lost Connection to ... and to attempt reconnection, however it looks like after the restart, the CSI Driver node component creates the socket and node-driver-registrar can see the socket file available but can't connect to it.

#322 is adding a throughout check on the /healthz endpoint so that in addition to checking the existence of the socket file node-driver-registrar makes a successful CSI call (NodeGetInfo) against the socket, if the call fails consistently then eventually the container will be restarted. Note that for this workaround the deployment manifests must enable a livenessProbe probing the /healthz endpoint, for more info about the /healthz setup please check this example in the README.

@gnufied
Copy link
Contributor

gnufied commented Jan 10, 2024

@mauriciopoppe sorry for late response, but the way I have seen this issue to happen is:

  1. CSI driver pod and registrar are running per-usual.
  2. Something causes another registrar container to be created (In my case, this happened because someone was trying to debug the driver-pod and created parallel container)
  3. When this happens, the new registrar creates a new registration socket file (which replaces old one, created in step#1) and since driver is still running, kubelet sees this new registration socket and everything works fine.
  4. Now when this debug container is terminated, it just leaves the registration socket file. But nobody is listening on it and since registrar doesn't have a mechanism of retrying, kubelet sees driver as removed.

So #322 mainly fixes issue with stale registration sockets, assuming health check is enabled on registration socket.

Now, I know that this issue has bugs linked where csi-driver socket is stale, but in current iteration of node-registrar, we only connect to csi-driver socket once on startup. We are in-fact closing the connection after retrieving driver-name and if for some reason node registrar can't connect to csi driver socket, it will simply exit on startup and restart itself. But as such - #322 does not fixes stale csi driver sockets issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests