Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchestration ID reuse policies #42

Open
cgillum opened this issue Oct 28, 2023 · 22 comments
Open

Orchestration ID reuse policies #42

cgillum opened this issue Oct 28, 2023 · 22 comments
Assignees

Comments

@cgillum
Copy link
Member

cgillum commented Oct 28, 2023

This idea comes from dapr/dapr#7101, as it relates to reusing existing workflow IDs.

When scheduling the creation of a new orchestration, the following options should be available:

  • TerminateIfExists - Terminates any existing orchestrations with the same instance ID and then schedules a new instance as one atomic action, similar to on-demand ContinueAsNew.
  • SkipIfExists - If there is an existing orchestration already scheduled, then the scheduler does nothing.
  • ThrowIfExists - If there is an existing orchestration then the scheduler throws an exception (this is the current behavior).

As an optional extension of this, we can consider exposing an "Execution ID" or "Generation ID" property to the orchestration which can help distinguish different instantiations of the given instance ID.

@clintsinger
Copy link

My vote for having the extra ID to be able to identify different runs of the workflow in logs.

@olitomlinson
Copy link

olitomlinson commented Nov 23, 2023

@cgillum

Would it be reasonable to add a hint of : SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED

An example, scenarios where workflows are being started via queue with at-least-once guarantees (which is quite common) it would be useful to have this hint because :

  • It would prevent a workflow being prematurely terminated & rescheduled if it was currently in a running state.
    • Restarting a running workflow may be wasteful and incur unwanted / unanticipated side-effects when repeated from the start.
  • It would prevent a workflow being rescheduled if it was in a complete state
    • Supporting idempotent design - Some business processes should never be repeated once they've ran to completion
  • It would prevent a workflow being rescheduled if it was in a paused state
    • Pausing a workflow would be a deliberate human action. Therefor automatically restarting it may cause undesirable side-effects.
  • However, it would allow a workflow in a failed state to be rescheduled, which could be desirable.

Unless I'm mistaken, It's my understanding that none of the 3 new reuse policies would accommodate the above scenario?

@clintsinger
Copy link

clintsinger commented Nov 23, 2023

The running state is exactly the one I want to interrupt for the scenario that I originally proposed. I agree that once it is complete then rescheduling is just a matter of starting a new instance.

The original definition of TerminateIfExists didn't care if it was just scheduled or already running. The workflow would be terminated. SkipIfExists wouldn't care either way either.

@olitomlinson
Copy link

olitomlinson commented Nov 23, 2023

Yes this is why I'm proposing an additional policy, so that it doesn't interfere with the needs of your use-case (which I think is TERMINATE_IF_EXISTS)

@clintsinger
Copy link

clintsinger commented Nov 23, 2023

So you are suggesting that one combines, for example, TERMINATE_IF_EXISTS and SKIP_IF_RUNNING_OR_COMPLETE?

As a way to differentiate what to do if it is just scheduled vs already running?

@clintsinger
Copy link

I am assuming this doesn't have much meaning for SKIP_IF_EXISTS and SKIP_IF_RUNNING_OR_COMPLETE since it would already be skipped from the first policy.

@olitomlinson
Copy link

olitomlinson commented Nov 24, 2023

This is how I interpret the current proposal :

Start = start/restart the orchestration
No-op = leave the orchestration in its current state (do not start/restart)

Doesn't Exist Running Completed Failed Paused
TERMINATE_IF_EXISTS Start Start Start Start Start
THROW_IF_EXISTS Start No-op No-op No-op No-op
SKIP_IF_EXISTS Start No-op No-op No-op No-op

Assuming my interpretation above is correct, I would like to see the following policy added SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED

Doesn't Exist Running Completed Failed Paused
SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED Start No-op No-op Start No-op

@kaibocai
Copy link
Member

Thanks @olitomlinson for the explanation on SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED I think it makes sense to support idempotent design. @cgillum what do you think about this OPTION?

@cgillum
Copy link
Member Author

cgillum commented Nov 28, 2023

Catching up on this discussion. If I understand the proposal from @olitomlinson, the problem you're trying to solve is that you want to only reuse the orchestration ID if an existing instance is in a failed state. I think that makes sense. Would it just be "failed", or could it also include "Terminated" (and perhaps some future "Canceled" state)?

@clintsinger
Copy link

Wouldn't it be easier just to say RESTART_IF_FAILED (or RESCHEDULE_IF_FAILED)?

It appears to only be a rule for when the workflow is in a failed state.

@olitomlinson
Copy link

@cgillum I agree, it should be SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED_OR_TERMINATED - I would bucket paused and terminate together as they are both deliberate human actions, which shouldn't be automatically restarted.

Although I think going down this path is going to lead to a trying to support a huge set of highly specific, Policies which isn't sustainable.

Suggestion

Divorce the ACTION (Terminate, Skip, Throw) from the applicable STATE/S, such that a user can dial in the exact requirement.

TERMINATE_IF_EXISTS would be expressed as...

WorkflowOptions { 
    ReusePolicy = new ReusePolicy(action: Action.Terminate, states: new[]{ State.All })
}

SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED_OR_TERMINATED would be expressed as...

WorkflowOptions { 
    ReusePolicy  = new ReusePolicy(action: Action.Skip, states: new[]{ State.Running, State.Complete, State.Paused, State.Terminated})
}

@clintsinger
Copy link

I'd recommend using enum flags for the states field.

@cgillum
Copy link
Member Author

cgillum commented Nov 28, 2023

I'm in agreement with the more fine-grained policy. It introduces more conceptual complexity, IMO, but it's more flexible, precise, and better aligned with existing APIs in the .NET version of the Durable Task Framework.

@olitomlinson
Copy link

I'd recommend using enum flags for the states field.

I've never used Enum Flags so no first hand experience, but they look great for this use-case!

@kaibocai
Copy link
Member

kaibocai commented Nov 30, 2023

When rethinking this proposal, I found there are some cases that are not quite clear. For ex:

ReusePolicy  = new ReusePolicy(action: Action.Skip, states: new[]{ State.Pending})

For this option, we are creating a new instance with action SKIP and target status Pending, which means it will try to create a new instance, but if it finds there is an instance in Pending status then it will skip creating the new instance. What if we have an instance in Running status, should we skip creating the new instance or should we terminate the running instance and create the new one?

Summarize the confusion in below pic
image

@clintsinger @cgillum @olitomlinson

@olitomlinson
Copy link

First thoughts, the first two seem straight forward :

Instance exists -> skip -> status not in target status set = terminate existing instance and create new one
Instance exists -> throw -> status not in target status set = terminate existing instance and create new one

However I'm stuck on this one...

Instance exists -> terminate -> status not in target status set = ?????

@clintsinger
Copy link

If we think of the second condition as a "when/where" clause then perhaps we can simplify things a bit and just eliminate the "skip" condition since that could be the default for the other two if the predicate isn't matched.

Or some variation on that.

@clintsinger
Copy link

The "variation on that" could be that we keep the "throw" as the default path as it is today.

@cgillum
Copy link
Member Author

cgillum commented Nov 30, 2023

Does this pseudocode seem like the right behavior?

ok = not exists

if exists:
  if skip and status_matches:
    # log a warning
    ok = True
  if terminate and status_matches:
    # terminate the instance
    ok = True

if not ok:
   # throw

if not skip:
  # create the instance

@kaibocai
Copy link
Member

I think it should be

if not exists:
  # create instance
else:
  ok = false
  if skip and status_matches:
    # log a warning
    ok = True
  if terminate and status_matches:
    # terminate the instance
    ok = True
  
  if not ok:
    # throw
  if not skip:
    # create instance

The reason is that when customer create a instance using
ctx.createOrhcestration(instanceID, action.SKIP, [status.Pending, status.Runnning])
I interpret as they still want to create an instance if there is no instance exists.
So that is to say, if no instance exists, the behavior of createOrhcestration should always be creating a new instance no matter what options used here.

Another thing to point out for the above pseudocode (Chris's as well) is that if the customer chooses the action THROW then no matter what statuses they set, it will always Throw if an instance already exists. I just want to make sure this is what we want.

@cgillum

@cgillum
Copy link
Member Author

cgillum commented Nov 30, 2023

Yeah, it looks like my skip logic wasn't quite right, but I suppose that's what unit tests are for. :)

And I see your point about throw. Since it's the default behavior anyways, using it with status filters is meaningless. Does that suggest we could get by without an explicit throw action, since that's the default behavior anyways? It seems we only need to opt-into skip and terminate.

@joebowbeer
Copy link

joebowbeer commented Jun 19, 2024

Is this related to supporting a start-with-signal pattern, or is that a separate concern?

start-with-signal will send an event to a specific workflow instance and make sure that instance is running. In that order, to avoid a race.

If the reuse policy can be attached to a specific start request, then I imagine this pattern could be implemented in two separate requests (signal + start) as I described.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants