Set Termination Grace Period to write_timeout for functions to allow them to complete during a scale down event. #869

alexellis · 2021-11-04T11:17:19Z

Description

Set Termination Grace Period to write_timeout for functions to allow them to complete during a scale down event.

Motivation and Context

Prior to this change, a function could only drain for 30 seconds, then Kubernetes would forcibly kill it. After this change, in--flight function invocations can drain according to the write_timeout environment variable.

For example, this function would have failed to complete in-flight requests that ran for 1m30s.

functions:
  go-long:
    lang: golang-middleware
    handler: ./go-long
    image: alexellis2/graceful:0.2.1-newest
    environment:
      write_timeout: 2m
      read_timeout: 2m
      exec_timeout: 2m
      handler_wait_duration: 1m30s
      healthcheck_interval: 5s
    annotations:
      topic: "pipeline.subscription"
    labels:
      com.openfaas.scale.min: 1
      com.openfaas.scale.max: 1

Fixes #853 #637

How Has This Been Tested?

I deployed openfaas with global timeouts of 2m as per the "Extended timeout" tutorial in the docs.

Then I deployed the go-long function with a 1m wait and 2m max write_timeout/exec_timeout.

Then I used hey and an extended -t (timeout flag) to schedule 5 requests.

At this point I scale down the function with kubectl scale and monitored the logs.

Normally, the default of 30 seconds would have caused the functions to exit, but instead they stayed running and the pod remained in a terminating status.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I've read the CONTRIBUTION guide
I have signed-off my commits with git commit -s
I have added tests to cover my changes.
All new and existing tests passed.

Also related to this change is a watchdog update to allow the Pod to exit early, if all in-flight requests have completed:

openfaas/of-watchdog@31e1d32

Configures the Pod's termination grace period in seconds to the write_timeout environment-variable, when available. Otherwise, it's set to the default for Kubernetes, which is 30 seconds. This change is important for allowing function containers to drain active connections, without losing work in progress. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>

Allow container registry to be overridden easily for local builds Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>

alexellis · 2021-11-04T11:18:05Z

To test this patch see: How I test OpenFaaS changes with Kubernetes

pkg/controller/deployment.go

LucasRoesler

Just one question about how we structure the code, but it otherwise looks good. If you don't want to split it into a helper, just let me know and I will approve the PR

alexellis · 2021-11-04T14:04:56Z

@LucasRoesler I think you also suggested adding 500ms +/- to the default write timeout for jitter.

The additional 2s should prevent an issue where the grace period is exactly the same as the timeout, and kills the Pod before the remaining requests have completed. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>

alexellis · 2021-11-04T14:57:21Z

@kevin-lindsay-1 would you like to take this for a spin?

You may need to delete your functions to try this.

kevin-lindsay-1 · 2021-11-04T16:39:23Z

@alexellis works good for me.

My testing procedure used Tilt to avoid needing to push the image to a registry

For the sake of anyone interested, here's how I did it:

copy your existing deployment's yaml from k8s, save it to a file named gateway-dep.yaml
change the image for the faas-netes container to simply be faas-netes; tilt will figure it out
put the file in the faas-netes project root
create a Tiltfile with the following contents:

# copied a `gateway-dep.yaml` from an existing deployment
k8s_yaml('./gateway-dep.yaml')

# build the faas-netes image
docker_build('faas-netes', '.')

# hack to copy over the .git folder
# https://github.com/tilt-dev/tilt/issues/2169
local_resource(
    'hack-copy_git',
    cmd='cp -R .git .newgit'
)

# deploy the gateway in kubernetes, tilt detects the newly built image
k8s_resource(
    'gateway',
    resource_deps=[
        'hack-copy_git'
    ]
)

update the docker image to RUN mv .newgit .git after the COPY . .
tilt up
delete function(s), redeploy them to see changes

For anyone who might want to use this as an example, please note that tilt recently added a kubectl_build command which uses buildkit inside of k8s, so you don't have to duplicate images on your machine. Haven't used it yet, but probably will.

alexellis · 2021-11-04T16:44:01Z

@alexellis works good for me.

💪

Thank you Kevin

@kevin-lindsay-1

This was missed from #869 and fixes the controller so that it updates the termination grace period for functions. The problem would be that if a user updated the write_timeout variable used to compute the grace period, it would be ignored. The operator uses common code for deploy/update, so does not need a separate change. Thanks to @kevin-lindsay-1 for doing exploratory testing to find this. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>

@kevin-lindsay-1

This was missed from #869 and fixes the controller so that it updates the termination grace period for functions. The problem would be that if a user updated the write_timeout variable used to compute the grace period, it would be ignored. The operator uses common code for deploy/update, so does not need a separate change. Thanks to @kevin-lindsay-1 for doing exploratory testing to find this. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>

alexellis added 2 commits November 3, 2021 10:18

Customise makefile

eb6ac46

Allow container registry to be overridden easily for local builds Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>

LucasRoesler reviewed Nov 4, 2021

View reviewed changes

pkg/controller/deployment.go Outdated Show resolved Hide resolved

LucasRoesler requested changes Nov 4, 2021

View reviewed changes

Add 2s jitter to termination grace period

4e3efd4

The additional 2s should prevent an issue where the grace period is exactly the same as the timeout, and kills the Pod before the remaining requests have completed. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>

alexellis self-assigned this Nov 4, 2021

alexellis requested a review from LucasRoesler November 4, 2021 16:44

LucasRoesler approved these changes Nov 4, 2021

View reviewed changes

alexellis merged commit f0a81e6 into master Nov 4, 2021

alexellis deleted the openfaasltd/grace-period-from-write-timeout branch November 4, 2021 17:10

alexellis mentioned this pull request Nov 5, 2021

Set termination grace period upon function updates in the controller #870

Merged

3 tasks

alexellis mentioned this pull request Nov 5, 2021

Add post on long-running jobs and functions openfaas/openfaas.github.io#257

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set Termination Grace Period to write_timeout for functions to allow them to complete during a scale down event. #869

Set Termination Grace Period to write_timeout for functions to allow them to complete during a scale down event. #869

alexellis commented Nov 4, 2021 •

edited

Loading

alexellis commented Nov 4, 2021

LucasRoesler left a comment

alexellis commented Nov 4, 2021

alexellis commented Nov 4, 2021

kevin-lindsay-1 commented Nov 4, 2021 •

edited

Loading

alexellis commented Nov 4, 2021

Set Termination Grace Period to write_timeout for functions to allow them to complete during a scale down event. #869

Set Termination Grace Period to write_timeout for functions to allow them to complete during a scale down event. #869

Conversation

alexellis commented Nov 4, 2021 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

alexellis commented Nov 4, 2021

LucasRoesler left a comment

Choose a reason for hiding this comment

alexellis commented Nov 4, 2021

alexellis commented Nov 4, 2021

kevin-lindsay-1 commented Nov 4, 2021 • edited Loading

alexellis commented Nov 4, 2021

alexellis commented Nov 4, 2021 •

edited

Loading

kevin-lindsay-1 commented Nov 4, 2021 •

edited

Loading