Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add orchestrator scheduled actions check when suspended #1024

Merged
merged 3 commits into from
Jan 12, 2024

Conversation

nytian
Copy link
Collaborator

@nytian nytian commented Jan 5, 2024

This PR fixes issue #2519

When calling SuspendAsync URL, condition can happen that the orchestrator is suspended when there is already an action scheduled and added to the orchestratorActionsMap. However, our design will add any later event into a temporary queue during suspended time. Thus, this scheduled action will never be removed or completed and will be executed endless times until orchestrator is resumed. And this execution history will also be added to the History Table which will cause the non-deterministic error when orchestrator is resumed.

The race condition occurs when an ExecutionSuspended message is processed after TaskCompleted. Additionally, there is a subsequent activity task pending scheduling after the completion of this task. During the processingTaskCompleted, the result is sent back to the extension level, and thus, durable extension will go to the next line and then schedule a new task, adds it to the action map which will cause the issue.

For debug, we can verify if a task has been executed multiple times by checking if columnTaskEventId in Kusto or TaskScheduledId in History Table. Each task should have a unique identifier in these fields.

The fix of this PR is to guarantee orchestratorActionsMap is empty when orchestrator is suspended. Basically, we will check the orchestratorActionsMap when suspending an orchestrator. If orchestratorActionsMap is not null, the action will be saved to a temporary dictionary and will be added back to the orchestratorActionsMap when the orchestrator is resumed.

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, but mostly looks good!

src/DurableTask.Core/TaskOrchestrationContext.cs Outdated Show resolved Hide resolved
@davidmrdavid
Copy link
Collaborator

davidmrdavid commented Jan 8, 2024

I know there's a "stress tests" project here that I believe is meant to catch race conditions like the bug you're fixing. I wonder if you can re-use that infrastructure to set up a test for suspend/resume. Do you might giving that a look?

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nytian: Do you have ideas for any telemetry that would have helped identify this bug faster? If so, I wonder if we could add that to this PR.

src/DurableTask.Core/TaskOrchestrationContext.cs Outdated Show resolved Hide resolved
src/DurableTask.Core/TaskOrchestrationContext.cs Outdated Show resolved Hide resolved
Copy link
Collaborator

@bachuv bachuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this issue!

@nytian nytian requested review from davidmrdavid and bachuv January 10, 2024 00:39
Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sync'ed w/ Naiyuan and, after the fact, we determined that we could have identified this bug by paying attention to the TaskScheduledId column of our telemetry. So I don't strictly think we need new logs anymore, just better team-wide awareness to pay attention to that source of information. Still, please get @bachuv's approval as well before merging.

Finally, @nytian, you wrote the following:

During the processingTaskCompleted, the result is sent back to the extension level, where SuspendAsync call is impossible to prevent the triggering of the next line. Thus, the extension will schedule a new activity and adds it to the action map which will cause the issue

Assuming I'm reading this correctly, I'm not certain that this statement is 100% accurate. In particular, DTFx.Core and the Extension run in the same thread so it's not quite right to say that "it's impossible [for DTFx] to prevent the triggering of the next line [of DF Extension user-code]." This is definitely possible so I think the explanation needs to be tweaked slightly :) . Let's discuss this offline later, it's just a nitpick

@nytian
Copy link
Collaborator Author

nytian commented Jan 10, 2024

I sync'ed w/ Naiyuan and, after the fact, we determined that we could have identified this bug by paying attention to the TaskScheduledId column of our telemetry. So I don't strictly think we need new logs anymore, just better team-wide awareness to pay attention to that source of information. Still, please get @bachuv's approval as well before merging.

Finally, @nytian, you wrote the following:

During the processingTaskCompleted, the result is sent back to the extension level, where SuspendAsync call is impossible to prevent the triggering of the next line. Thus, the extension will schedule a new activity and adds it to the action map which will cause the issue

Assuming I'm reading this correctly, I'm not certain that this statement is 100% accurate. In particular, DTFx.Core and the Extension run in the same thread so it's not quite right to say that "it's impossible [for DTFx] to prevent the triggering of the next line [of DF Extension user-code]." This is definitely possible so I think the explanation needs to be tweaked slightly :) . Let's discuss this offline later, it's just a nitpick

@davidmrdavid . Thanks for pointing out! I will update this part.

@davidmrdavid davidmrdavid merged commit 1585a61 into main Jan 12, 2024
2 checks passed
@nytian nytian deleted the nytian/suspend_bug branch January 18, 2024 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants