Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix command ordering issue on child workflow cancel #588

Merged
merged 4 commits into from
Oct 20, 2021

Conversation

cretz
Copy link
Member

@cretz cretz commented Oct 14, 2021

What was changed

Child workflow cancel command was sent in the order of when in started instead of when it was canceled. This moves the command to the end of the order once it's canceled.

Why?

Child workflow cancellation was being recorded based on when the child started instead of when the child was actually canceled.

For discussion

Two primary questions:

  1. Is it acceptable that this fix is tailored to child workflow cancellation? It would be a large undertaking to revisit the entire state machine to make sure that "states" that issue new commands are moved to the end of the command list every time.
  2. Can a StartChildWorkflowExecution command and a RequestCancelExternalWorkflowExecution for that same workflow ever occur in the same RespondWorkflowTaskCompleted server call? If so this fix of moving the cancel to the end would need to be changed to instead create an entirely new command/state on cancel leaving the create one alone. I do not believe this is the case since the state can only represent one "state" at a time.

@cretz cretz force-pushed the child-cancel-race branch from e3583c5 to 5f7e295 Compare October 14, 2021 15:35
Comment on lines +1284 to +1286
// child then signals its cancellation from a simulated goroutine and
// immediately starts an activity. Previously, the SDK would put the cancel
// command before the execute command since the child workflow started first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on how/why this was different during replay than the first time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure (assuming you mean here). Here are the steps that actually occur on normal run:

  1. Start child workflow
  2. Execute activity
  3. Cancel child workflow

But what was happening in the command queue was:

  1. Create child workflow state in command slot 1
  2. Create execute activity in command slot 2
  3. Update existing child workflow state as canceled without reorder

That means that even though the run did a "execute then cancel" (and the event/activity IDs are incremented as such) the commands sent to the server were "cancel then execute" because of command queue order. So on replay it comes back with "cancel then execute" which does not match the expected IDs of "execute then cancel" that really happened.

So this fix makes the command order sent to the server is actually in the order of it happening. Namely that "execute then cancel" is sent as "execute then cancel". Specifically that a "cancel" is considered a new command at the end of the queue and doesn't just update the existing slot where it was started. I do wonder, and is one of the questions posed in the initial description, how many of these bugs exist where we are updating an existing state slot with a new command.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks, that largely makes sense to me except for one thing - starting the child workflow happened in a separate, previous, workflow task. So the command buffer shouldn't have anything about the child at all in the problematic task until it hits the cancel, which should happen after the activity.

Restated: If, for this workflow task, user code said "schedule activity" and then "cancel child wf", the commands should go out in that order because they should be the only commands in the list. There shouldn't already be some command about the child workflow at the front of the queue that the cancel "mutates"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my reading, command state is only "removed" if it is completed or as part of the special-case change you made with #504. In this case, ExecuteChildWorkflow creates and pushes the *childWorkflowCommandStateMachine onto the queue and then continually mutates that until the state is complete (not to be confused with canceled).

So the state is pushed and set to "created" and that command is sent but the state remains. Then an execute activity command state is pushed and the child workflow state is changed to cancel, in that order. Then the list of commands is walked for their state and that's how it sees cancel first.

There shouldn't already be some command about the child workflow at the front of the queue that the cancel "mutates"

This seems to be the case w/ all the commandStateMachine impls. Every type of thing that instantiates a command state machine seems to add it to the list when created, mutate it during its lifecycle, and only remove it on completion. And if a mutation causes a new command to be sent, it will be sent in the order the command state machine was created, not when the state was changed to make the new command.

Knowing what I know now, I could probably cause other out of order situations due to this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Knowing what I know now, I could probably cause other out of order situations due to this.

Ok, I confirmed that when it comes to lazily creating commands based on state after create, only *childWorkflowCommandStateMachine creates a command that is not related to the state machine creation itself (on cancel, the source of this bug).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, all the states will remain. What should not remain is the command to send in the queue, because we're in a new workflow task from the one where ExecuteChildWorkflow happened.

Ignoring the specifics of go (and proper names for things, lol), what should happen looks something like this:

=== First wft ===
Workflow code: Start a child workflow, block waiting for it to start
State machine states: {child wf: send command}
Command queue: [Start child wf]
=== Second wft ===
History: Child wf has started
Workflow code: Done waiting child wf to start, schedule activity 1, block on (not sure)
State machine states: {child wf: child started, act 1: send command}
Command queue: [schedule act 1]
=== Third wft ===
History: act 1 completed
Workflow code: Done waiting on act1, schedule activity 2, cancel child wf, 
State machine states: {child wf: sending cancel command, act 1: resolved, act 2: send command}
Command queue: [schedule act 2, cancel child wf]

The important thing I get at here is the command queue is drained after every wft. The states of the command machines persist (or are removed when they are completed -- matters for memory but not semantically), but commands to send are always per-workflow task.

If I understand you correctly you're saying that the cancel command ends up first because the child wf state machine existed first, or is somehow persistent in the command queue, and if so that's the real problem here. What should happen, if the code says schedule act 2, then cancel the child wf, is that we create a new state machine for act 2, queue the command, then look up the state machine for the child wf, run it's cancel handler, which should produce the cancel command queued behind schedule act 2.

I think part of what makes this very confusing is Go is linking commands directly to state machines but the real relationship is more like "state machines produce commands".

Copy link
Member Author

@cretz cretz Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should not remain is the command to send in the queue

The command is not in the queue. I apologize for using terms interchangeably that may be confusing. There is no command queue, there is only a command state queue. The command is lazily created based on the state's position in the queue (most of the time...see getCommand impls in internal_decision_state_machine.go, sometimes it is precreated). To build the gRPC request, the state queue is walked asking for commands. Only in the child workflow state's case does it create a new command for later states.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK wow I just went and reviewed the code and I had just not realized that the two things are just completely conflated in the Go sdk. Java and Core both drain their (rough) equivalent to orderedCommands after the end of each WFT.

I had forgotten that just... doesn't happen here. This conflation of pending commands with the state machines is kinda the origin of the problem here, but fixing that is a quite large change.

So this is what you mean by

There is no command queue, there is only a command state queue.

Eek. In that case, what you've done here is I think probably the most reasonable thing we can do for now... Thanks for walking through it with me

// entire child workflow is recorded based on when it started not when it
// was canceled, we have to move it to the end once canceled to keep the
// expected commands in order of when they actually occurred.
d.helper.moveCommandToBack(d)
Copy link
Member

@Sushisource Sushisource Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that this could actually break compat with existing workflow histories. In the history the user reported they had

13 WorkflowTaskScheduled                           1s                  Taskqueue localhost:9fff2b4a-e875-4d99-bfa1-4203d6271561 Timeout 10s
14 WorkflowTaskStarted                             1s                  requestId 890e099a-518e-44e4-aecd-679a90d14f1a
15 WorkflowTaskCompleted                           1s                  identity 2132641@localhost@
16 RequestCancelExternalWorkflowExecutionInitiated 1s                  workflowTaskCompletedEventId 15 namespace default workflowExecution.workflowId 061d626c-5782-49df-ba43-cf071a2469a2_5 workflowExecution.runId control childWorkflowOnly true
17 ActivityTaskScheduled                           1s                  Close Timeout Id 16 input [ 1000000000 ] Name Wait

But this change would swap 16 and 17, correct?

This also addresses your question 1 -- which is that we can't make changes that shift around command ordering w/o breaking compat. Of course if that's the only way we can fix the bug, then we have to, but ideally the fix only impacts workflows that would've broken anyway. Ideally, we'd generally fix any command type this can happen to, but ensuring compat / not breaking other things takes priority. Hopefully that helps

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this change would swap 16 and 17, correct?

Correct, because that history is wrong compared to when it actually ran. and if 16 and 17 remain how they are in this situation, in all cases I can think of it'd have broken on replay.

but ideally the fix only impacts workflows that would've broken anyway

I think this only applies to cases where child workflows were started but a command happens in another coroutine before they are canceled. And I think they would be broken anyways on replay. I am struggling to think how this could affect a successful workflow replay but would definitely appreciate a thorough review by others.

I would love to take a safer approach if one can be devised.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is all good. If such histories never could have worked in the first place, then it's an appropriate fix

@Sushisource
Copy link
Member

To answer 2 -- I think it could, but it shouldn't. The SDK should simply not send any commands in that circumstance.

Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me at this point. Worth getting another set of eyes.

cretz added 2 commits October 20, 2021 09:32
…l-race

# Conflicts:
#	test/integration_test.go
#	test/workflow_test.go
@cretz cretz merged commit 3b59754 into temporalio:master Oct 20, 2021
@cretz cretz deleted the child-cancel-race branch October 20, 2021 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants