-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add orchestrator scheduled actions check when suspended #1024
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits, but mostly looks good!
I know there's a "stress tests" project here that I believe is meant to catch race conditions like the bug you're fixing. I wonder if you can re-use that infrastructure to set up a test for suspend/resume. Do you might giving that a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nytian: Do you have ideas for any telemetry that would have helped identify this bug faster? If so, I wonder if we could add that to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this issue!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sync'ed w/ Naiyuan and, after the fact, we determined that we could have identified this bug by paying attention to the TaskScheduledId
column of our telemetry. So I don't strictly think we need new logs anymore, just better team-wide awareness to pay attention to that source of information. Still, please get @bachuv's approval as well before merging.
Finally, @nytian, you wrote the following:
During the processingTaskCompleted, the result is sent back to the extension level, where SuspendAsync call is impossible to prevent the triggering of the next line. Thus, the extension will schedule a new activity and adds it to the action map which will cause the issue
Assuming I'm reading this correctly, I'm not certain that this statement is 100% accurate. In particular, DTFx.Core and the Extension run in the same thread so it's not quite right to say that "it's impossible [for DTFx] to prevent the triggering of the next line [of DF Extension user-code]." This is definitely possible so I think the explanation needs to be tweaked slightly :) . Let's discuss this offline later, it's just a nitpick
@davidmrdavid . Thanks for pointing out! I will update this part. |
This PR fixes issue #2519
When calling SuspendAsync URL, condition can happen that the orchestrator is suspended when there is already an action scheduled and added to the
orchestratorActionsMap
. However, our design will add any later event into a temporary queue during suspended time. Thus, this scheduled action will never be removed or completed and will be executed endless times until orchestrator is resumed. And this execution history will also be added to the History Table which will cause the non-deterministic error when orchestrator is resumed.The race condition occurs when an
ExecutionSuspended
message is processed afterTaskCompleted
. Additionally, there is a subsequent activity task pending scheduling after the completion of this task. During the processingTaskCompleted
, the result is sent back to the extension level, and thus, durable extension will go to the next line and then schedule a new task, adds it to the action map which will cause the issue.For debug, we can verify if a task has been executed multiple times by checking if column
TaskEventId
in Kusto orTaskScheduledId
in History Table. Each task should have a unique identifier in these fields.The fix of this PR is to guarantee
orchestratorActionsMap
is empty when orchestrator is suspended. Basically, we will check theorchestratorActionsMap
when suspending an orchestrator. IforchestratorActionsMap
is not null, the action will be saved to a temporary dictionary and will be added back to theorchestratorActionsMap
when the orchestrator is resumed.