Configurable probe termination grace periods #2241

ehashman · 2021-01-07T23:28:12Z

See #2238. Straw proposal to be discussed at the next SIG Node meeting.

/sig node
/cc @smarterclayton @derekwaynecarr

SergeyKanzhelev · 2021-01-08T00:18:22Z

/cc: @matthyx

matthyx · 2021-01-08T06:17:00Z

Thanks for the cc @SergeyKanzhelev !
/lgtm
Maybe Tim would like to comment on the api change too?
/cc @thockin

keps/sig-node/2238-liveness-probe-grace-period/README.md

ehashman · 2021-01-08T18:23:25Z

/retest

keps/sig-node/2238-liveness-probe-grace-period/README.md

dims · 2021-01-11T11:51:42Z

/hold (plenty of feedback needs to be rolled in)

ehashman · 2021-01-13T01:56:22Z

I think I've addressed all comments, please take another look @matthyx @dims @thockin @logicalhan

I will remove the hold since the LGTM has been removed :)

/hold cancel

derekwaynecarr · 2021-01-13T18:21:11Z

will take review for sig-node.
/assign

keps/sig-node/2238-liveness-probe-grace-period/README.md

keps/sig-node/2238-liveness-probe-grace-period/kep.yaml

keps/sig-node/2238-liveness-probe-grace-period/README.md

SergeyKanzhelev · 2021-01-21T21:13:36Z

keps/sig-node/2238-liveness-probe-grace-period/README.md

+> open connections), but when the controller wedges it takes an hour to
+> resolve, which is the worst possible outcome (the process wedges immediately,
+> but kube has to wait the full hour to kill it, defeating the goal of
+> liveness). [comment](https://github.com/kubernetes/kubernetes/issues/64715#issuecomment-756201502)


is readiness probe helpful here? Theoretically readiness probe may change the status of the pod so it will be taken out of LB? And in non-ready state it can drain the connections while another ready pod will be scheduled.

if this is not the case, maybe it can be explained "why" in text. This was always my understanding of a difference of readiness and liveness probes

cc @smarterclayton, this example was from his comment; I think that we specifically want a liveness and not a readiness probe for this use case.

@SergeyKanzhelev usage of a readiness probe is orthogonal since the primary issue is a failed liveness or startup probe should not require the normal graceful termination period.

keps/sig-node/2238-liveness-probe-grace-period/README.md

ehashman · 2021-01-27T22:30:18Z

/label tide/merge-method-squash

SergeyKanzhelev · 2021-01-27T22:49:48Z

keps/sig-node/2238-liveness-probe-grace-period/README.md

+`terminationGracePeriodSeconds` value.
+
+This change would apply to [liveness and startup probes][probes1], but **not**
+readiness probes, which [use a different code path][probes2].


this is being changed by @matthyx I believe as we speak

the readiness probe cannot trigger a stop of container....

SergeyKanzhelev · 2021-01-27T22:52:30Z

keps/sig-node/2238-liveness-probe-grace-period/README.md

+### Goals
+
+- Liveness probes can have a configurable timeout on failure.
+- Existing behaviour is not changed.


is there any chance to get a race between the termination on node shutdown and liveness probe failure. Let's say liveness probe is set up to be extremely sensitive to the termination signal that pod receives. So pod started termination because it received sig kill, liveness probe immediately failed, and instead of regular interval, pod assumes liveness probe's interval override. So depending whether liveness probe checked in the period between termination started and finished, grace period will be different.

The termination paths for pod deletion/node drain and liveness probe failures are sort of separate (they use the same underlying function calls to kill the containers, but pod deletion always includes a hard override for the termination grace period). I don't think this would introduce any greater chance of a race condition being encountered.

Is it possible right now for a liveness probe to trigger specific container termination during a pod already in process of terminating? That's the only scenario I can think of that might pose an issue.

@matthyx is fixing this exact scenario, so it will no longer be possible: kubernetes/kubernetes#98571

agree we should not probe for startup/liveness when a pod is undergoing graceful termination. fwiw, if you send multiple stop calls to a container today, both primary runtimes hold a lock that prevents the subsequent call from running or reducing the termination period. this is being fixed as well https://github.com/kubernetes/kubernetes/pull/98507/files

keps/sig-node/2238-liveness-probe-grace-period/README.md

johnbelamaric · 2021-02-02T21:56:31Z

PRR looks good but I will wait on sig-node approval to not accidentally approval everything.

ehashman · 2021-02-08T16:48:21Z

@derekwaynecarr can you PTAL for final approval?

derekwaynecarr

/approve
/lgtm

derekwaynecarr · 2021-02-08T19:51:59Z

keps/sig-node/2238-liveness-probe-grace-period/README.md

+> open connections), but when the controller wedges it takes an hour to
+> resolve, which is the worst possible outcome (the process wedges immediately,
+> but kube has to wait the full hour to kill it, defeating the goal of
+> liveness). [comment](https://github.com/kubernetes/kubernetes/issues/64715#issuecomment-756201502)


@SergeyKanzhelev usage of a readiness probe is orthogonal since the primary issue is a failed liveness or startup probe should not require the normal graceful termination period.

derekwaynecarr · 2021-02-08T19:54:20Z

keps/sig-node/2238-liveness-probe-grace-period/README.md

+### Goals
+
+- Liveness probes can have a configurable timeout on failure.
+- Existing behaviour is not changed.


agree we should not probe for startup/liveness when a pod is undergoing graceful termination. fwiw, if you send multiple stop calls to a container today, both primary runtimes hold a lock that prevents the subsequent call from running or reducing the termination period. this is being fixed as well https://github.com/kubernetes/kubernetes/pull/98507/files

derekwaynecarr · 2021-02-08T19:55:04Z

keps/sig-node/2238-liveness-probe-grace-period/README.md

+`terminationGracePeriodSeconds` value.
+
+This change would apply to [liveness and startup probes][probes1], but **not**
+readiness probes, which [use a different code path][probes2].


the readiness probe cannot trigger a stop of container....

ehashman · 2021-02-08T21:25:53Z

@johnbelamaric can you do the final approval now?

johnbelamaric · 2021-02-08T21:50:56Z

/approve

k8s-ci-robot · 2021-02-08T21:51:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, ehashman, johnbelamaric

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-node/OWNERS~~ [derekwaynecarr,johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Draft provisional KEP for probe grace period

f44d43f

k8s-ci-robot requested review from derekwaynecarr and smarterclayton January 7, 2021 23:28

k8s-ci-robot requested a review from thockin January 8, 2021 06:17

k8s-ci-robot assigned matthyx Jan 8, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 8, 2021

dims reviewed Jan 8, 2021

View reviewed changes

keps/sig-node/2238-liveness-probe-grace-period/README.md Outdated Show resolved Hide resolved

dims reviewed Jan 8, 2021

View reviewed changes

keps/sig-node/2238-liveness-probe-grace-period/README.md Show resolved Hide resolved

dims reviewed Jan 8, 2021

View reviewed changes

keps/sig-node/2238-liveness-probe-grace-period/README.md Outdated Show resolved Hide resolved

dims reviewed Jan 8, 2021

View reviewed changes

keps/sig-node/2238-liveness-probe-grace-period/README.md Show resolved Hide resolved

logicalhan reviewed Jan 8, 2021

View reviewed changes

keps/sig-node/2238-liveness-probe-grace-period/README.md Outdated Show resolved Hide resolved

ehashman mentioned this pull request Jan 9, 2021

Add configurable grace period to probes #2238

Closed

12 tasks

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 11, 2021

Address community feedback

50c7ee6

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 13, 2021

ehashman changed the title ~~Draft provisional KEP for probe grace period~~ Provisional KEP for probe grace period Jan 13, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 13, 2021

k8s-ci-robot assigned derekwaynecarr Jan 13, 2021

ehashman mentioned this pull request Jan 13, 2021

add KEP: pull image priority #2217

Closed

smarterclayton reviewed Jan 13, 2021

View reviewed changes

keps/sig-node/2238-liveness-probe-grace-period/README.md Show resolved Hide resolved

Fill out the rest of the required sections for targeting alpha

2944ab9

ehashman commented Jan 13, 2021

View reviewed changes

keps/sig-node/2238-liveness-probe-grace-period/kep.yaml Show resolved Hide resolved

ehashman assigned johnbelamaric Jan 13, 2021

johnbelamaric reviewed Jan 21, 2021

View reviewed changes

keps/sig-node/2238-liveness-probe-grace-period/README.md Show resolved Hide resolved

keps/sig-node/2238-liveness-probe-grace-period/README.md Outdated Show resolved Hide resolved

keps/sig-node/2238-liveness-probe-grace-period/README.md Outdated Show resolved Hide resolved

SergeyKanzhelev reviewed Jan 21, 2021

View reviewed changes

keps/sig-node/2238-liveness-probe-grace-period/README.md Show resolved Hide resolved

ehashman force-pushed the kep-2238 branch from ba8d7ee to d8496c0 Compare January 27, 2021 22:30

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jan 27, 2021

SergeyKanzhelev reviewed Jan 27, 2021

View reviewed changes

keps/sig-node/2238-liveness-probe-grace-period/README.md Show resolved Hide resolved

johnbelamaric requested changes Jan 29, 2021

View reviewed changes

keps/sig-node/2238-liveness-probe-grace-period/README.md Outdated Show resolved Hide resolved

Update PRR requirements, respond to more feedback

6370827

ehashman force-pushed the kep-2238 branch from d8496c0 to 6370827 Compare January 29, 2021 21:42

ehashman changed the title ~~Provisional KEP for probe grace period~~ Configurable probe termination grace periods Feb 2, 2021

Add graduation criteria

74c393e

ehashman force-pushed the kep-2238 branch from ac51a2c to 74c393e Compare February 8, 2021 17:52

derekwaynecarr approved these changes Feb 8, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2021

k8s-ci-robot merged commit e9b84b1 into kubernetes:master Feb 8, 2021

k8s-ci-robot added this to the v1.21 milestone Feb 8, 2021

palnabarun mentioned this pull request Jun 12, 2022

canonical kep number json field palnabarun/enhancements#1

Closed

palnabarun mentioned this pull request Jul 26, 2022

KEP Manifest Generator palnabarun/enhancements#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable probe termination grace periods #2241

Configurable probe termination grace periods #2241

ehashman commented Jan 7, 2021

SergeyKanzhelev commented Jan 8, 2021

matthyx commented Jan 8, 2021

ehashman commented Jan 8, 2021

dims commented Jan 11, 2021

ehashman commented Jan 13, 2021

derekwaynecarr commented Jan 13, 2021

SergeyKanzhelev Jan 21, 2021

SergeyKanzhelev Jan 21, 2021

ehashman Jan 21, 2021

derekwaynecarr Feb 8, 2021

ehashman commented Jan 27, 2021

SergeyKanzhelev Jan 27, 2021

derekwaynecarr Feb 8, 2021

SergeyKanzhelev Jan 27, 2021

ehashman Jan 28, 2021

ehashman Jan 29, 2021

derekwaynecarr Feb 8, 2021

johnbelamaric commented Feb 2, 2021

ehashman commented Feb 8, 2021

derekwaynecarr left a comment

derekwaynecarr Feb 8, 2021

derekwaynecarr Feb 8, 2021

derekwaynecarr Feb 8, 2021

ehashman commented Feb 8, 2021

johnbelamaric commented Feb 8, 2021

k8s-ci-robot commented Feb 8, 2021

Configurable probe termination grace periods #2241

Configurable probe termination grace periods #2241

Conversation

ehashman commented Jan 7, 2021

SergeyKanzhelev commented Jan 8, 2021

matthyx commented Jan 8, 2021

ehashman commented Jan 8, 2021

dims commented Jan 11, 2021

ehashman commented Jan 13, 2021

derekwaynecarr commented Jan 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehashman commented Jan 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnbelamaric commented Feb 2, 2021

ehashman commented Feb 8, 2021

derekwaynecarr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehashman commented Feb 8, 2021

johnbelamaric commented Feb 8, 2021

k8s-ci-robot commented Feb 8, 2021