-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set Termination Grace Period to write_timeout for functions to allow them to complete during a scale down event. #869
Conversation
Configures the Pod's termination grace period in seconds to the write_timeout environment-variable, when available. Otherwise, it's set to the default for Kubernetes, which is 30 seconds. This change is important for allowing function containers to drain active connections, without losing work in progress. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
Allow container registry to be overridden easily for local builds Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
To test this patch see: How I test OpenFaaS changes with Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question about how we structure the code, but it otherwise looks good. If you don't want to split it into a helper, just let me know and I will approve the PR
@LucasRoesler I think you also suggested adding 500ms +/- to the default write timeout for jitter. |
The additional 2s should prevent an issue where the grace period is exactly the same as the timeout, and kills the Pod before the remaining requests have completed. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
@kevin-lindsay-1 would you like to take this for a spin? You may need to delete your functions to try this. |
@alexellis works good for me. My testing procedure used Tilt to avoid needing to push the image to a registry For the sake of anyone interested, here's how I did it:
# copied a `gateway-dep.yaml` from an existing deployment
k8s_yaml('./gateway-dep.yaml')
# build the faas-netes image
docker_build('faas-netes', '.')
# hack to copy over the .git folder
# https://github.com/tilt-dev/tilt/issues/2169
local_resource(
'hack-copy_git',
cmd='cp -R .git .newgit'
)
# deploy the gateway in kubernetes, tilt detects the newly built image
k8s_resource(
'gateway',
resource_deps=[
'hack-copy_git'
]
)
For anyone who might want to use this as an example, please note that tilt recently added a |
💪 Thank you Kevin |
This was missed from #869 and fixes the controller so that it updates the termination grace period for functions. The problem would be that if a user updated the write_timeout variable used to compute the grace period, it would be ignored. The operator uses common code for deploy/update, so does not need a separate change. Thanks to @kevin-lindsay-1 for doing exploratory testing to find this. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
This was missed from #869 and fixes the controller so that it updates the termination grace period for functions. The problem would be that if a user updated the write_timeout variable used to compute the grace period, it would be ignored. The operator uses common code for deploy/update, so does not need a separate change. Thanks to @kevin-lindsay-1 for doing exploratory testing to find this. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
Description
Set Termination Grace Period to
write_timeout
for functions to allow them to complete during a scale down event.Motivation and Context
Prior to this change, a function could only drain for 30 seconds, then Kubernetes would forcibly kill it. After this change, in--flight function invocations can drain according to the write_timeout environment variable.
For example, this function would have failed to complete in-flight requests that ran for
1m30s
.Fixes #853 #637
How Has This Been Tested?
I deployed openfaas with global timeouts of 2m as per the "Extended timeout" tutorial in the docs.
Then I deployed the go-long function with a 1m wait and 2m max write_timeout/exec_timeout.
Then I used hey and an extended
-t
(timeout flag) to schedule 5 requests.At this point I scale down the function with
kubectl scale
and monitored the logs.Normally, the default of 30 seconds would have caused the functions to exit, but instead they stayed running and the pod remained in a terminating status.
Types of changes
Checklist:
git commit -s
Also related to this change is a watchdog update to allow the Pod to exit early, if all in-flight requests have completed:
openfaas/of-watchdog@31e1d32