You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To allow functions to complete a potentially long living function call during down scaling, faas-netes should modify terminationGracePeriodSeconds which has a 30 second default value. This is the time between the SIGTERM signal and the final SIGKILL.
Expected Behaviour
If a function call has started, and a pod is marked as Terminating, I expect the function call to complete before it is terminated. This is already handled in the watchdogs, but as far as I know, k8s will kill the pods after terminationGracePeriodSeconds which defaults to 30 seconds, independently of what happens in the pods.
Current Behaviour
Function call is terminated before it is finished.
Possible Solution
Add a new option to faas-netes to set terminationGracePeriodSeconds, or set it to the value of faasnetes.writeTimeout. I suggest the former. My team can happily contribute to fix this.
We want to create a function based on e.g. the python3 template with read_timeout=600, write_timeout=600 and max_inflight=1 that contains the code
import time
def handler(req):
time.sleep(300) # 5 minutes
If we then perform 3 async function calls, only one will start. We can then scale the functions up to 3 and all 3 will run. If we now down scale it, two function pods will terminate before they are finished.
Context
We at Cognite are trying to handle auto scaling based on other metrics than just many function calls. We have seen memory heavy function calls that kill the pods, so tips from @alexellis was to control the max_inflight on each function. If we have max_inflight=5, the 6th function call will return a 429 to the queue worker. We have then added a metric that triggers an upscaling if we have any 429 over the past 30 seconds or so, which works very well. However, when it scales down after those 30 seconds (no more 429 once it has scaled up) we ended up getting a lot of errors because the pods were terminated.
Your Environment
GKE and latest helm template as of today. Will provide more if necessary.
The text was updated successfully, but these errors were encountered:
To allow functions to complete a potentially long living function call during down scaling,
faas-netes
should modifyterminationGracePeriodSeconds
which has a 30 second default value. This is the time between the SIGTERM signal and the final SIGKILL.Expected Behaviour
If a function call has started, and a pod is marked as
Terminating
, I expect the function call to complete before it is terminated. This is already handled in the watchdogs, but as far as I know, k8s will kill the pods afterterminationGracePeriodSeconds
which defaults to 30 seconds, independently of what happens in the pods.Current Behaviour
Function call is terminated before it is finished.
Possible Solution
Add a new option to
faas-netes
to setterminationGracePeriodSeconds
, or set it to the value offaasnetes.writeTimeout
. I suggest the former. My team can happily contribute to fix this.Steps to Reproduce (for bugs)
Set up OpenFaaS on k8s with
We want to create a function based on e.g. the
python3
template withread_timeout=600
,write_timeout=600
andmax_inflight=1
that contains the codeIf we then perform 3 async function calls, only one will start. We can then scale the functions up to 3 and all 3 will run. If we now down scale it, two function pods will terminate before they are finished.
Context
We at Cognite are trying to handle auto scaling based on other metrics than just many function calls. We have seen memory heavy function calls that kill the pods, so tips from @alexellis was to control the
max_inflight
on each function. If we havemax_inflight=5
, the 6th function call will return a 429 to the queue worker. We have then added a metric that triggers an upscaling if we have any 429 over the past 30 seconds or so, which works very well. However, when it scales down after those 30 seconds (no more 429 once it has scaled up) we ended up getting a lot of errors because the pods were terminated.Your Environment
GKE and latest helm template as of today. Will provide more if necessary.
The text was updated successfully, but these errors were encountered: