Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skaffold Deploy Fails on Readiness Probe Failures of Terminating Pods #6758

Closed
ynouri opened this issue Oct 22, 2021 · 7 comments
Closed

Skaffold Deploy Fails on Readiness Probe Failures of Terminating Pods #6758

ynouri opened this issue Oct 22, 2021 · 7 comments
Labels
area/deploy area/docs area/status-check kind/bug Something isn't working kind/usability needs-reproduction needs reproduction from the maintainers to validate the issue is truly a skaffold bug priority/p2 May take a couple of releases

Comments

@ynouri
Copy link

ynouri commented Oct 22, 2021

Expected behavior

When running skaffold deploy, status checks / stabilization shouldn't fail because the readiness probe of a terminating pod is failing. As far as I understand, the following discussion establishes that readiness probe failures for terminating pods are expected.

kubernetes/kubernetes#98571 (comment)
kubernetes/kubernetes#52817

Actual behavior

Skaffold fails to stabilize when a terminating pod has a readiness probe failure.

Information

Steps to reproduce the behavior

  1. Kubernetes deployment with a readiness probe
  2. Deploy once to make sure the deployment has already some pods running
  3. skaffold deploy
  4. During rollout previous pods are terminated with readiness probe failures
  5. Skaffold fails to stabilize because of the terminating pods readiness probe failures, even though the new pods are fine

Logs

/opt/skaffold deploy \
    --filename .my-service/skaffold.yaml \
    --images=... \
    --force=false

...

Starting deploy...

...

 - deployment.apps/my-service configured

...

Waiting for deployments to stabilize...
 - pave-api:deployment/my-service: Readiness probe failed: 
    - pave-api:pod/my-service-8c5d4f4f4-t8c7j: Readiness probe failed: 
    - pave-api:pod/my-service-bbcc9d978-ksh2q: creating container my-service
 
...

 - pave-api:deployment/my-service: Readiness probe failed: 
    - pave-api:pod/my-service-8c5d4f4f4-t8c7j: Readiness probe failed: 

...

 - pave-api:deployment/my-service: Readiness probe failed: 
    - pave-api:pod/my-service-8c5d4f4f4-t8c7j: Readiness probe failed: 

    - pave-api:pod/my-service-8c5d4f4f4-t8c7j: Readiness probe failed: 
 - pave-api:deployment/my-service: container my-service terminated with exit code 137
    - pave-api:pod/my-service-8c5d4f4f4-859sd: container my-service terminated with exit code 137
      > [my-service-8c5d4f4f4-859sd my-service] unable to retrieve container logs for docker://6951e8c119efef055b8678a30e9c30ebee0c7d5848d841164248a65967900870
    - pave-api:pod/my-service-8c5d4f4f4-t8c7j: Readiness probe failed: 
 - pave-api:deployment/my-service failed. Error: container my-service terminated with exit code 137.

In the logs above:

  • pod/my-service-8c5d4f4f4-t8c7j is terminating. Pod logs indicate that the pod is sent a TERM signal and is able to gracefully terminate with some expected readiness probe failures
  • pod/my-service-bbcc9d978-ksh2q is starting
@ynouri
Copy link
Author

ynouri commented Oct 22, 2021

After writing this issue I realized that our pod is actually exiting with a code 137 (meaning the process was killed by a signal SIGKILL). This indicates the pod may not be terminating gracefully. We will keep investigating on our end.

@ynouri
Copy link
Author

ynouri commented Oct 23, 2021

After more investigation, part of the issue was related to the Python watchtower module hanging (see kislyuk/watchtower#141) and preventing our Gunicorn pods to gracefully terminate on the TERM signal. Disabling watchtower from the app fixed the problem: once the pods terminate gracefully, Skaffold status checks pass.

I am still curious whether we expect Skaffold status checks to fail because of issues related to the terminating pods? It seems that even though there was an underlying problem in our application, it should not matter whether pods are exiting gracefully or not. Only the health of new pods should matter. In our case, the new pods were fine, and it would have seemed reasonable for Skaffold to deem the deployment stable.

@aaron-prindle aaron-prindle added needs-reproduction needs reproduction from the maintainers to validate the issue is truly a skaffold bug area/status-check area/deploy kind/bug Something isn't working kind/usability area/docs triage/discuss Items for discussion labels Oct 25, 2021
@aaron-prindle
Copy link
Contributor

aaron-prindle commented Oct 25, 2021

I was not able to reproduce this issue with what I believe was a similar setup. Can you try using the latest version of skaffold - v1.33.1: https://github.com/GoogleContainerTools/skaffold/releases and see if you are still able to reproduce this issue?

Also, are you explicitly setting a skaffold.dev/run-id when doing the skaffold deploy? Currently it should not be the case that a skaffold deploy would have any reference to a previous deploy as skaffold creates a new run-id on each run, no deploy state should be carried over.

@ynouri
Copy link
Author

ynouri commented Oct 25, 2021

Also, are you explicitly setting a skaffold.dev/run-id when doing the skaffold deploy? Currently it should not be the case that a skaffold deploy would have any reference to a previous deploy as skaffold creates a new run-id on each run, no state deploy state should be carried over.

Yes, you are right, we are! We are using the following environment variable during skaffold deploy:

SKAFFOLD_LABEL: skaffold.dev/run-id=PREVENT_REDEPLOY

We recently started doing this as a workaround to #2745. Without this flag, pods were always restarted on deploy (because of a changing pod template hash), even though no changes happened in the original deployment definition.

@aaron-prindle
Copy link
Contributor

aaron-prindle commented Oct 25, 2021

Ah I see, the issue you are encountering is related to both deploys having the same value for skaffold.dev/run-id. This run-id value tells skaffold what resources to monitor (skaffold adds this label on resources it deployed and looks for it when status-checking, etc.) so that is why the subsequent deploy monitors the previous deploy's resources. @briandealwis are you familiar w/ any other workarounds for #2745 that might work here and not have this issue? Is there a possible workaround for the issue @ynouri mentioned -

Without this flag, pods were always restarted on deploy (because of a changing pod template hash), even though no changes happened in the original deployment definition.

@briandealwis
Copy link
Member

@ynouri there have been a lot of changes to Skaffold's status checking since 1.22.0, so I'd urge you to update. The fix in #6370 (in v1.30.0) seems particularly pertinent here.

@nkubala nkubala added priority/p2 May take a couple of releases and removed triage/discuss Items for discussion labels Oct 29, 2021
@ynouri
Copy link
Author

ynouri commented Nov 11, 2021

@briandealwis @aaron-prindle,

We just encountered the same issue once again with a different deployment (not the Python deployment which we worked around by removing the watchtower library preventing quick shutdown of the pod).

This time we tried @briandealwis's suggestion to update to the latest version and it fixed our skaffold deploy.

Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deploy area/docs area/status-check kind/bug Something isn't working kind/usability needs-reproduction needs reproduction from the maintainers to validate the issue is truly a skaffold bug priority/p2 May take a couple of releases
Projects
None yet
Development

No branches or pull requests

4 participants