-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable probe termination grace periods #2241
Conversation
/cc: @matthyx |
Thanks for the cc @SergeyKanzhelev ! |
/retest |
/hold (plenty of feedback needs to be rolled in) |
I think I've addressed all comments, please take another look @matthyx @dims @thockin @logicalhan I will remove the hold since the LGTM has been removed :) /hold cancel |
will take review for sig-node. |
> open connections), but when the controller wedges it takes an hour to | ||
> resolve, which is the worst possible outcome (the process wedges immediately, | ||
> but kube has to wait the full hour to kill it, defeating the goal of | ||
> liveness). [comment](https://github.com/kubernetes/kubernetes/issues/64715#issuecomment-756201502) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is readiness probe helpful here? Theoretically readiness probe may change the status of the pod so it will be taken out of LB? And in non-ready state it can drain the connections while another ready pod will be scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is not the case, maybe it can be explained "why" in text. This was always my understanding of a difference of readiness and liveness probes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @smarterclayton, this example was from his comment; I think that we specifically want a liveness and not a readiness probe for this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev usage of a readiness probe is orthogonal since the primary issue is a failed liveness or startup probe should not require the normal graceful termination period.
/label tide/merge-method-squash |
`terminationGracePeriodSeconds` value. | ||
|
||
This change would apply to [liveness and startup probes][probes1], but **not** | ||
readiness probes, which [use a different code path][probes2]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is being changed by @matthyx I believe as we speak
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the readiness probe cannot trigger a stop of container....
### Goals | ||
|
||
- Liveness probes can have a configurable timeout on failure. | ||
- Existing behaviour is not changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any chance to get a race between the termination on node shutdown and liveness probe failure. Let's say liveness probe is set up to be extremely sensitive to the termination signal that pod receives. So pod started termination because it received sig kill, liveness probe immediately failed, and instead of regular interval, pod assumes liveness probe's interval override. So depending whether liveness probe checked in the period between termination started and finished, grace period will be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The termination paths for pod deletion/node drain and liveness probe failures are sort of separate (they use the same underlying function calls to kill the containers, but pod deletion always includes a hard override for the termination grace period). I don't think this would introduce any greater chance of a race condition being encountered.
Is it possible right now for a liveness probe to trigger specific container termination during a pod already in process of terminating? That's the only scenario I can think of that might pose an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthyx is fixing this exact scenario, so it will no longer be possible: kubernetes/kubernetes#98571
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree we should not probe for startup/liveness when a pod is undergoing graceful termination. fwiw, if you send multiple stop calls to a container today, both primary runtimes hold a lock that prevents the subsequent call from running or reducing the termination period. this is being fixed as well https://github.com/kubernetes/kubernetes/pull/98507/files
PRR looks good but I will wait on sig-node approval to not accidentally approval everything. |
@derekwaynecarr can you PTAL for final approval? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
> open connections), but when the controller wedges it takes an hour to | ||
> resolve, which is the worst possible outcome (the process wedges immediately, | ||
> but kube has to wait the full hour to kill it, defeating the goal of | ||
> liveness). [comment](https://github.com/kubernetes/kubernetes/issues/64715#issuecomment-756201502) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev usage of a readiness probe is orthogonal since the primary issue is a failed liveness or startup probe should not require the normal graceful termination period.
### Goals | ||
|
||
- Liveness probes can have a configurable timeout on failure. | ||
- Existing behaviour is not changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree we should not probe for startup/liveness when a pod is undergoing graceful termination. fwiw, if you send multiple stop calls to a container today, both primary runtimes hold a lock that prevents the subsequent call from running or reducing the termination period. this is being fixed as well https://github.com/kubernetes/kubernetes/pull/98507/files
`terminationGracePeriodSeconds` value. | ||
|
||
This change would apply to [liveness and startup probes][probes1], but **not** | ||
readiness probes, which [use a different code path][probes2]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the readiness probe cannot trigger a stop of container....
@johnbelamaric can you do the final approval now? |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, ehashman, johnbelamaric The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
See #2238. Straw proposal to be discussed at the next SIG Node meeting.
/sig node
/cc @smarterclayton @derekwaynecarr