-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix command ordering issue on child workflow cancel #588
Conversation
e3583c5
to
5f7e295
Compare
// child then signals its cancellation from a simulated goroutine and | ||
// immediately starts an activity. Previously, the SDK would put the cancel | ||
// command before the execute command since the child workflow started first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on how/why this was different during replay than the first time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure (assuming you mean here). Here are the steps that actually occur on normal run:
- Start child workflow
- Execute activity
- Cancel child workflow
But what was happening in the command queue was:
- Create child workflow state in command slot 1
- Create execute activity in command slot 2
- Update existing child workflow state as canceled without reorder
That means that even though the run did a "execute then cancel" (and the event/activity IDs are incremented as such) the commands sent to the server were "cancel then execute" because of command queue order. So on replay it comes back with "cancel then execute" which does not match the expected IDs of "execute then cancel" that really happened.
So this fix makes the command order sent to the server is actually in the order of it happening. Namely that "execute then cancel" is sent as "execute then cancel". Specifically that a "cancel" is considered a new command at the end of the queue and doesn't just update the existing slot where it was started. I do wonder, and is one of the questions posed in the initial description, how many of these bugs exist where we are updating an existing state slot with a new command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks, that largely makes sense to me except for one thing - starting the child workflow happened in a separate, previous, workflow task. So the command buffer shouldn't have anything about the child at all in the problematic task until it hits the cancel, which should happen after the activity.
Restated: If, for this workflow task, user code said "schedule activity" and then "cancel child wf", the commands should go out in that order because they should be the only commands in the list. There shouldn't already be some command about the child workflow at the front of the queue that the cancel "mutates"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my reading, command state is only "removed" if it is completed or as part of the special-case change you made with #504. In this case, ExecuteChildWorkflow
creates and pushes the *childWorkflowCommandStateMachine
onto the queue and then continually mutates that until the state is complete (not to be confused with canceled).
So the state is pushed and set to "created" and that command is sent but the state remains. Then an execute activity command state is pushed and the child workflow state is changed to cancel, in that order. Then the list of commands is walked for their state and that's how it sees cancel first.
There shouldn't already be some command about the child workflow at the front of the queue that the cancel "mutates"
This seems to be the case w/ all the commandStateMachine
impls. Every type of thing that instantiates a command state machine seems to add it to the list when created, mutate it during its lifecycle, and only remove it on completion. And if a mutation causes a new command to be sent, it will be sent in the order the command state machine was created, not when the state was changed to make the new command.
Knowing what I know now, I could probably cause other out of order situations due to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Knowing what I know now, I could probably cause other out of order situations due to this.
Ok, I confirmed that when it comes to lazily creating commands based on state after create, only *childWorkflowCommandStateMachine
creates a command that is not related to the state machine creation itself (on cancel, the source of this bug).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, all the states will remain. What should not remain is the command to send in the queue, because we're in a new workflow task from the one where ExecuteChildWorkflow
happened.
Ignoring the specifics of go (and proper names for things, lol), what should happen looks something like this:
=== First wft ===
Workflow code: Start a child workflow, block waiting for it to start
State machine states: {child wf: send command}
Command queue: [Start child wf]
=== Second wft ===
History: Child wf has started
Workflow code: Done waiting child wf to start, schedule activity 1, block on (not sure)
State machine states: {child wf: child started, act 1: send command}
Command queue: [schedule act 1]
=== Third wft ===
History: act 1 completed
Workflow code: Done waiting on act1, schedule activity 2, cancel child wf,
State machine states: {child wf: sending cancel command, act 1: resolved, act 2: send command}
Command queue: [schedule act 2, cancel child wf]
The important thing I get at here is the command queue is drained after every wft. The states of the command machines persist (or are removed when they are completed -- matters for memory but not semantically), but commands to send are always per-workflow task.
If I understand you correctly you're saying that the cancel command ends up first because the child wf state machine existed first, or is somehow persistent in the command queue, and if so that's the real problem here. What should happen, if the code says schedule act 2, then cancel the child wf, is that we create a new state machine for act 2, queue the command, then look up the state machine for the child wf, run it's cancel handler, which should produce the cancel command queued behind schedule act 2.
I think part of what makes this very confusing is Go is linking commands directly to state machines but the real relationship is more like "state machines produce commands".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should not remain is the command to send in the queue
The command is not in the queue. I apologize for using terms interchangeably that may be confusing. There is no command queue, there is only a command state queue. The command is lazily created based on the state's position in the queue (most of the time...see getCommand
impls in internal_decision_state_machine.go
, sometimes it is precreated). To build the gRPC request, the state queue is walked asking for commands. Only in the child workflow state's case does it create a new command for later states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK wow I just went and reviewed the code and I had just not realized that the two things are just completely conflated in the Go sdk. Java and Core both drain their (rough) equivalent to orderedCommands
after the end of each WFT.
I had forgotten that just... doesn't happen here. This conflation of pending commands with the state machines is kinda the origin of the problem here, but fixing that is a quite large change.
So this is what you mean by
There is no command queue, there is only a command state queue.
Eek. In that case, what you've done here is I think probably the most reasonable thing we can do for now... Thanks for walking through it with me
// entire child workflow is recorded based on when it started not when it | ||
// was canceled, we have to move it to the end once canceled to keep the | ||
// expected commands in order of when they actually occurred. | ||
d.helper.moveCommandToBack(d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that this could actually break compat with existing workflow histories. In the history the user reported they had
13 WorkflowTaskScheduled 1s Taskqueue localhost:9fff2b4a-e875-4d99-bfa1-4203d6271561 Timeout 10s
14 WorkflowTaskStarted 1s requestId 890e099a-518e-44e4-aecd-679a90d14f1a
15 WorkflowTaskCompleted 1s identity 2132641@localhost@
16 RequestCancelExternalWorkflowExecutionInitiated 1s workflowTaskCompletedEventId 15 namespace default workflowExecution.workflowId 061d626c-5782-49df-ba43-cf071a2469a2_5 workflowExecution.runId control childWorkflowOnly true
17 ActivityTaskScheduled 1s Close Timeout Id 16 input [ 1000000000 ] Name Wait
But this change would swap 16 and 17, correct?
This also addresses your question 1 -- which is that we can't make changes that shift around command ordering w/o breaking compat. Of course if that's the only way we can fix the bug, then we have to, but ideally the fix only impacts workflows that would've broken anyway. Ideally, we'd generally fix any command type this can happen to, but ensuring compat / not breaking other things takes priority. Hopefully that helps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this change would swap 16 and 17, correct?
Correct, because that history is wrong compared to when it actually ran. and if 16 and 17 remain how they are in this situation, in all cases I can think of it'd have broken on replay.
but ideally the fix only impacts workflows that would've broken anyway
I think this only applies to cases where child workflows were started but a command happens in another coroutine before they are canceled. And I think they would be broken anyways on replay. I am struggling to think how this could affect a successful workflow replay but would definitely appreciate a thorough review by others.
I would love to take a safer approach if one can be devised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is all good. If such histories never could have worked in the first place, then it's an appropriate fix
To answer 2 -- I think it could, but it shouldn't. The SDK should simply not send any commands in that circumstance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me at this point. Worth getting another set of eyes.
…l-race # Conflicts: # test/integration_test.go # test/workflow_test.go
What was changed
Child workflow cancel command was sent in the order of when in started instead of when it was canceled. This moves the command to the end of the order once it's canceled.
Why?
Child workflow cancellation was being recorded based on when the child started instead of when the child was actually canceled.
For discussion
Two primary questions:
StartChildWorkflowExecution
command and aRequestCancelExternalWorkflowExecution
for that same workflow ever occur in the sameRespondWorkflowTaskCompleted
server call? If so this fix of moving the cancel to the end would need to be changed to instead create an entirely new command/state on cancel leaving the create one alone. I do not believe this is the case since the state can only represent one "state" at a time.