Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale controller and long running activity functions issue #353

Closed
alifintcev opened this issue Jun 20, 2018 · 4 comments
Closed

Scale controller and long running activity functions issue #353

alifintcev opened this issue Jun 20, 2018 · 4 comments

Comments

@alifintcev
Copy link

Hi,

Please check following scenario:

  1. Func_A triggered called via QueueTrigger. Func_A started orchestration function Func_orch_A
  2. Func_orch_A fan-out about 20 sub-orchestrations Func_sub_orch_A
  3. Every Func_sub_orch_A fan-out up to 1000 activity functions Func_act_A

So far it looks good, environment gradually scaled out up to 40-50 instances, and processing all Func_act_A as expected. It takes about 1 hour to complete.

During running scenario above another function Func_B triggered via Queue:

  1. Func_B started orchestration Func_orch_B.
  2. Func_orch_B started activity function Func_act_B. Func_act_B usually takes 2-3 hours to complete.

At this point scale controller started Func_act_B on one of instances (let's say INST_25) for Func_act_A created before. That's OK.

But when Func_A completed scale controller starts scaling down.
It turning off running instances:

Stopping host INST_50
Stopping host INST_49
Stopping host INST_48
...
Stopping host INST_25 

At this point it does not matter if instance CPU 90% stable loaded. It just stops instance in the middle of Func_act_B running and stopped host.

Then environment found that Func_act_B is require to be restarted and started it on INST_15, then on INST_7 etc.
So function Func_act_B restarts until it started on stateful instance (it's probably always the same instance per function app).

As a result it consumed a lot more resorces and time to complete.

So I think there is an issue with scale controller that stopping instances without taking into account functions running there.

Thanks,
Alex

@SimonLuckenuik
Copy link

Some comments:

@alifintcev
Copy link
Author

Hi Simon,

My understanding was that 10 minutes limitation is not applying on Durable scenario. On one of my apps I have ActivityTrigger functions that running for several hours and it works fine.

I'm OK if it's strict limitation, but I can see that there is not. So I think this case (activity function restarts) should be documented.

Thanks,
Alex

@SimonLuckenuik
Copy link

  • Each Activity from your Orchestration is still an Azure Function have the same execution restrictions. The complete workflow can "last longer" because the whole Orchestration is persisted between the different triggers (and not executing while waiting for triggers), so the lifetime is longer, but actual execution time of each part of the orchestration still follow the Azure Functions rules.

  • I know there were bugs related to the execution Timeout at some point, that might explain why the > 10 minutes execution time was not enforced.

@cgillum
Copy link
Collaborator

cgillum commented Jun 21, 2018

Simon is correct. The scale controller was not designed to support long-running stateless function execution of any kind (whether Durable Functions, timer triggers, queue triggers, etc.). The fact that you're getting 2-3 hour executions is unfortunately a bug in the function host. I don't know the current status of that bug, but I believe it applies to precompiled functions (and fixing it now would likely cause too much grief for people who've started depending on it).

In the possibly near future, we're considering supporting long-running executions for Azure Functions. That would allow your functions to execute for long periods of time and not be killed so aggressively by the scale controller. However, there would still be a chance that the VM you are running on gets reclaimed by Azure for maintenance purposes, so it would be good to write your code defensively to handle those cases regardless of timeout constraints. Until these changes happen to the Azure Functions consumption plan infrastructure, however, what you're observing is unfortunately the expected behavior.

If you have more thoughts or concerns about long-running executions in general, I recommend raising them in the Azure Functions GitHub repo so you can get better feedback (you may even find some existing issues on this topic).

@cgillum cgillum closed this as completed Jun 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants