Skaffold Deploy Fails on Readiness Probe Failures of Terminating Pods #6758

ynouri · 2021-10-22T17:15:03Z

Expected behavior

When running skaffold deploy, status checks / stabilization shouldn't fail because the readiness probe of a terminating pod is failing. As far as I understand, the following discussion establishes that readiness probe failures for terminating pods are expected.

kubernetes/kubernetes#98571 (comment)
kubernetes/kubernetes#52817

Actual behavior

Skaffold fails to stabilize when a terminating pod has a readiness probe failure.

Information

Skaffold version: v1.22.0
Operating system: Ubuntu 20.04
Installed via: https://storage.googleapis.com/skaffold/releases/v1.22.0/skaffold-linux-amd64

Steps to reproduce the behavior

Kubernetes deployment with a readiness probe
Deploy once to make sure the deployment has already some pods running
skaffold deploy
During rollout previous pods are terminated with readiness probe failures
Skaffold fails to stabilize because of the terminating pods readiness probe failures, even though the new pods are fine

Logs

/opt/skaffold deploy \
    --filename .my-service/skaffold.yaml \
    --images=... \
    --force=false

...

Starting deploy...

...

 - deployment.apps/my-service configured

...

Waiting for deployments to stabilize...
 - pave-api:deployment/my-service: Readiness probe failed: 
    - pave-api:pod/my-service-8c5d4f4f4-t8c7j: Readiness probe failed: 
    - pave-api:pod/my-service-bbcc9d978-ksh2q: creating container my-service
 
...

 - pave-api:deployment/my-service: Readiness probe failed: 
    - pave-api:pod/my-service-8c5d4f4f4-t8c7j: Readiness probe failed: 

...

 - pave-api:deployment/my-service: Readiness probe failed: 
    - pave-api:pod/my-service-8c5d4f4f4-t8c7j: Readiness probe failed: 

    - pave-api:pod/my-service-8c5d4f4f4-t8c7j: Readiness probe failed: 
 - pave-api:deployment/my-service: container my-service terminated with exit code 137
    - pave-api:pod/my-service-8c5d4f4f4-859sd: container my-service terminated with exit code 137
      > [my-service-8c5d4f4f4-859sd my-service] unable to retrieve container logs for docker://6951e8c119efef055b8678a30e9c30ebee0c7d5848d841164248a65967900870
    - pave-api:pod/my-service-8c5d4f4f4-t8c7j: Readiness probe failed: 
 - pave-api:deployment/my-service failed. Error: container my-service terminated with exit code 137.

In the logs above:

pod/my-service-8c5d4f4f4-t8c7j is terminating. Pod logs indicate that the pod is sent a TERM signal and is able to gracefully terminate with some expected readiness probe failures
pod/my-service-bbcc9d978-ksh2q is starting

The text was updated successfully, but these errors were encountered:

ynouri · 2021-10-22T17:17:13Z

After writing this issue I realized that our pod is actually exiting with a code 137 (meaning the process was killed by a signal SIGKILL). This indicates the pod may not be terminating gracefully. We will keep investigating on our end.

ynouri · 2021-10-23T02:58:19Z

After more investigation, part of the issue was related to the Python watchtower module hanging (see kislyuk/watchtower#141) and preventing our Gunicorn pods to gracefully terminate on the TERM signal. Disabling watchtower from the app fixed the problem: once the pods terminate gracefully, Skaffold status checks pass.

I am still curious whether we expect Skaffold status checks to fail because of issues related to the terminating pods? It seems that even though there was an underlying problem in our application, it should not matter whether pods are exiting gracefully or not. Only the health of new pods should matter. In our case, the new pods were fine, and it would have seemed reasonable for Skaffold to deem the deployment stable.

aaron-prindle · 2021-10-25T18:14:16Z

I was not able to reproduce this issue with what I believe was a similar setup. Can you try using the latest version of skaffold - v1.33.1: https://github.com/GoogleContainerTools/skaffold/releases and see if you are still able to reproduce this issue?

Also, are you explicitly setting a skaffold.dev/run-id when doing the skaffold deploy? Currently it should not be the case that a skaffold deploy would have any reference to a previous deploy as skaffold creates a new run-id on each run, no deploy state should be carried over.

ynouri · 2021-10-25T18:35:02Z

Also, are you explicitly setting a skaffold.dev/run-id when doing the skaffold deploy? Currently it should not be the case that a skaffold deploy would have any reference to a previous deploy as skaffold creates a new run-id on each run, no state deploy state should be carried over.

Yes, you are right, we are! We are using the following environment variable during skaffold deploy:

SKAFFOLD_LABEL: skaffold.dev/run-id=PREVENT_REDEPLOY

We recently started doing this as a workaround to #2745. Without this flag, pods were always restarted on deploy (because of a changing pod template hash), even though no changes happened in the original deployment definition.

aaron-prindle · 2021-10-25T19:00:26Z

Ah I see, the issue you are encountering is related to both deploys having the same value for skaffold.dev/run-id. This run-id value tells skaffold what resources to monitor (skaffold adds this label on resources it deployed and looks for it when status-checking, etc.) so that is why the subsequent deploy monitors the previous deploy's resources. @briandealwis are you familiar w/ any other workarounds for #2745 that might work here and not have this issue? Is there a possible workaround for the issue @ynouri mentioned -

Without this flag, pods were always restarted on deploy (because of a changing pod template hash), even though no changes happened in the original deployment definition.

briandealwis · 2021-10-26T14:17:40Z

@ynouri there have been a lot of changes to Skaffold's status checking since 1.22.0, so I'd urge you to update. The fix in #6370 (in v1.30.0) seems particularly pertinent here.

ynouri · 2021-11-11T02:22:44Z

@briandealwis @aaron-prindle,

We just encountered the same issue once again with a different deployment (not the Python deployment which we worked around by removing the watchtower library preventing quick shutdown of the pod).

This time we tried @briandealwis's suggestion to update to the latest version and it fixed our skaffold deploy.

Thank you for your help!

aaron-prindle added needs-reproduction needs reproduction from the maintainers to validate the issue is truly a skaffold bug area/status-check area/deploy kind/bug Something isn't working kind/usability area/docs triage/discuss Items for discussion labels Oct 25, 2021

nkubala added priority/p2 May take a couple of releases and removed triage/discuss Items for discussion labels Oct 29, 2021

ynouri closed this as completed Nov 11, 2021

ynouri mentioned this issue Dec 6, 2021

Skaffold run is not idempotent anymore #2745

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skaffold Deploy Fails on Readiness Probe Failures of Terminating Pods #6758

Skaffold Deploy Fails on Readiness Probe Failures of Terminating Pods #6758

ynouri commented Oct 22, 2021

ynouri commented Oct 22, 2021

ynouri commented Oct 23, 2021

aaron-prindle commented Oct 25, 2021 •

edited

Loading

ynouri commented Oct 25, 2021

aaron-prindle commented Oct 25, 2021 •

edited

Loading

briandealwis commented Oct 26, 2021

ynouri commented Nov 11, 2021

Skaffold Deploy Fails on Readiness Probe Failures of Terminating Pods #6758

Skaffold Deploy Fails on Readiness Probe Failures of Terminating Pods #6758

Comments

ynouri commented Oct 22, 2021

Expected behavior

Actual behavior

Information

Steps to reproduce the behavior

Logs

ynouri commented Oct 22, 2021

ynouri commented Oct 23, 2021

aaron-prindle commented Oct 25, 2021 • edited Loading

ynouri commented Oct 25, 2021

aaron-prindle commented Oct 25, 2021 • edited Loading

briandealwis commented Oct 26, 2021

ynouri commented Nov 11, 2021

aaron-prindle commented Oct 25, 2021 •

edited

Loading

aaron-prindle commented Oct 25, 2021 •

edited

Loading