Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness Probe Failure #336

Closed
reddy-s opened this issue Feb 11, 2021 · 13 comments
Closed

Liveness Probe Failure #336

reddy-s opened this issue Feb 11, 2021 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@reddy-s
Copy link

reddy-s commented Feb 11, 2021

/kind bug

Issue
All pods part of the efs-csi-node demon set log the following error

efs-csi-node-76fhh liveness-probe I0211 14:22:19.953274       1 connection.go:180] GRPC call: /csi.v1.Identity/Probe
efs-csi-node-76fhh liveness-probe I0211 14:22:19.953279       1 connection.go:181] GRPC request: {}
efs-csi-node-76fhh liveness-probe I0211 14:22:19.953980       1 connection.go:183] GRPC response: {}
efs-csi-node-76fhh liveness-probe I0211 14:22:19.954350       1 connection.go:184] GRPC error: <nil>
efs-csi-node-txznc liveness-probe I0211 14:22:20.084964       1 connection.go:180] GRPC call: /csi.v1.Identity/Probe
efs-csi-node-txznc liveness-probe I0211 14:22:20.084968       1 connection.go:181] GRPC request: {}
efs-csi-node-txznc liveness-probe I0211 14:22:20.085787       1 connection.go:183] GRPC response: {}
efs-csi-node-txznc liveness-probe I0211 14:22:20.086315       1 connection.go:184] GRPC error: <nil>

I tried to ignore this and went ahead to mount an EFS volume to one of the pods and it says that the mount failed.

Reproducing the issue

  • Deploy the helm chart using
helm upgrade --install aws-efs-csi-driver aws-efs-csi-driver/aws-efs-csi-driver
  • Default values used for the helm chart

Environment

  • Kubernetes version: 1.18
  • Chart version: aws-efs-csi-driver-1.1.1
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 11, 2021
@anaclaudiar
Copy link

anaclaudiar commented Feb 18, 2021

I have kind of the same problem, the mount doesn't fail, however, it keeps triggering monitoring alerts because of the "Liveness probe failed: HTTP probe failed with statuscode: 500".
Kubernetes version: 1.16
Chart version: aws-efs-csi-driver-1.1.1

@timrcoulson
Copy link

I had similar and increased the initial delay on the probe and things have settled. Might be worth making some of that configurable through the helm chart.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 9, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@joegraviton
Copy link

I had similar and increased the initial delay on the probe and things have settled. Might be worth making some of that configurable through the helm chart.

I get same issue, the probe only fails sometimes.
However, all my containers have already been running for one week and never restarts.
I don't understand why probe is failing with 500.
Also, increasing delay won't help in my case since my container doesn't restart, right ?

@manoelhc
Copy link

manoelhc commented Mar 17, 2023

just got the same issue. Any ideas? k8s 1.25

@micolun
Copy link

micolun commented Sep 13, 2023

same with AWS EFS CSI Driver Addon.
Kubernetes version 1.26
Addon version v1.5.8-eksbuild.1

restarted 6 times

W0913 15:04:08.076212       1 connection.go:173] Still connecting to unix:///csi/csi.sock
E0913 15:06:35.272240       1 main.go:74] health check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0913 15:08:17.266731       1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
E0913 15:08:21.565558       1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded

Got above error message in liveness-probe container in both controller and daemon sets pods

Not sure it's important, this is the liveness probe setting. Don't know if it's backwards compatible, but looks like the image version is for 1.27, not 1.26

    - name: liveness-probe
      image: >-
        602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/livenessprobe:v2.10.0-eks-1-27-3

@aleonsan
Copy link

aleonsan commented Sep 21, 2023

I had same issue
Kubernetes version: 1.24 (yes, I know ...)
Chart version: aws-efs-csi-driver-2.4.9 (appVersion 1.6.0)

Doing some reverse ingeneering I manage to see that there are images for each eks version family for livenessprobe, node-driver-registrar and external-provisioner. I just went to public ECR gallery https://gallery.ecr.aws/ and there you will find them.

image

They also provide tags -latest for each eks family (e.g eks-distro/kubernetes-csi/livenessprobe:v2.10.0-eks-1-24-latest), wchi is quite handy to do not to be forced to update image each time there is a patch.

In my case, the original problem persist

Failed to establish connection to CSI driver: context deadline exceeded

Hope this helps.

@quantori-pokidovea
Copy link

The same issue.
Kubernetes version 1.28
aws-efs-csi-driver v2.0.3-eksbuild.1

E0608 15:59:57.937486       1 main.go:67] "Failed to establish connection to CSI driver" err="context deadline exceeded"
E0608 16:00:01.881657       1 main.go:77] "Health check failed" err="rpc error: code = Canceled desc = context canceled"
E0608 16:00:05.061272       1 main.go:77] "Health check failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

@Affan-7
Copy link

Affan-7 commented Jun 28, 2024

/reopen

Facing the same issue.
Kubernetes version 1.29
aws-efs-csi-driver v2.0.3-eksbuild.1

E0628 06:55:45.994174       1 main.go:77] "Health check failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0628 06:56:01.404779       1 main.go:67] "Failed to establish connection to CSI driver" err="context deadline exceeded"
E0628 06:56:08.620869       1 main.go:77] "Health check failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0628 06:56:11.561200       1 main.go:77] "Health check failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0628 06:56:14.657515       1 main.go:77] "Health check failed" err="rpc error: code = Canceled desc = context canceled"
E0628 06:57:13.948817       1 main.go:77] "Health check failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0628 06:58:11.133036       1 main.go:67] "Failed to establish connection to CSI driver" err="context deadline exceeded"
E0628 06:58:17.548904       1 main.go:77] "Health check failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

@k8s-ci-robot
Copy link
Contributor

@Affan-7: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Facing the same issue.
Kubernetes version 1.29
aws-efs-csi-driver v2.0.3-eksbuild.1

E0628 06:55:45.994174       1 main.go:77] "Health check failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0628 06:56:01.404779       1 main.go:67] "Failed to establish connection to CSI driver" err="context deadline exceeded"
E0628 06:56:08.620869       1 main.go:77] "Health check failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0628 06:56:11.561200       1 main.go:77] "Health check failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0628 06:56:14.657515       1 main.go:77] "Health check failed" err="rpc error: code = Canceled desc = context canceled"
E0628 06:57:13.948817       1 main.go:77] "Health check failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0628 06:58:11.133036       1 main.go:67] "Failed to establish connection to CSI driver" err="context deadline exceeded"
E0628 06:58:17.548904       1 main.go:77] "Health check failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests