[wip] Add reproduction case for workflows not completing from signal #3

jeffschoner · 2022-08-14T01:56:50Z

@dwillett I've added extra logging to your repro. The output can be seen at https://gist.github.com/jeffschoner/a4be45f7f74f76480f13cd4624d60b2e

This is consistent with what I had sketched out over Slack, where the condition is evaluated before the timer callback runs. These are sequenced in this way because the wait_until happens before the timer is started (which only happens once the signal comes in which is after the workflow is started).

A few observations/notes:

The signal is sent so quickly after the workflow is started that it gets handled in the first workflow task. This simplifies what I mentioned in a Slack (2 workflow tasks instead of 3), but does not affect the behavior.

In the outputs, the handle registered for * * is for the wait_until. These are vaguely identified because they're not associated with a specific event. They run every time something in the workflow changes after they're registered (hence, the * for wildcard).

The two replays help a bit to show why dispatches need to be called in a deterministic way upon replay. The second workflow task begins exactly as the first ran. If this didn't runthis way, there's potential for non-determinism because there are little blocks of workflow code running between many of the log lines. A different order could mean the workflow exiting at a different point in execution (missing the start of a timer or activity), or not running an activity or timer in a callback
83), this sort of workflow would be broken.

When I restore the behavior of calling the wildcard handlers (aka the handlers for wait_until) after all the other handlers, your test does pass. I'm still working on a getting a repro for the bug I was seeing before that caused me to make coinbase#183. Unfortunately, the workflow where we saw this before is complex and uses a bunch of internal code, so it can't be surfaced publicly very easily.

jeffschoner · 2022-08-15T00:18:11Z

BLUF

I'm mostly convinced that the bug I thought was fixed by reordering the wildcard handlers was not, and that the behavior to run wait_until handlers after the others is correct. While I've been able to reproduce the errors I saw before in one of our workflows, I have not been able to create a repro that works on the current dispatch ordering but fails on the old dispatch ordering.

The old bug I was trying to fix

The failures I'd previously observed were related to the error produced when a command is sent up after workflow completion. Drew improved the error message here, but then I continued to investigate its origin in one of our workflows:

          raise Temporal::WorkflowAlreadyCompletingError.new(
            "You cannot do anything in a Workflow after it completes. #{context_string}, "\
            "but then it sent a new command: #{command.class}.  This can happen, for example, if you've "\
            "not waited for all of your Activity futures before finishing the Workflow."
          )

I thought this was related to non-deterministic ordering, but after digging through this some more today, I'm more convinced this was related to one callback resuming a fiber that completed the workflow, followed by another fiber resuming that tried to produce a side effect, due to activity futures that were not being properly waited upon before completing the workflow. Ultimately, I think this was deterministic within a single workflow run but not across workflow runs, where slower than usual activity execution delayed when the callback ran.

Antstorm had raised some questions about ordering the wildcard handlers in one of the earlier PRs, or perhaps in discussions on Temporal Slack. I can't find this conversation but remember having it. I think I overcorrected to his questions in making coinbase#183, which also contained important performance fixes.

Suggested fix going forward

Dispatching order should be:

Any handlers tied to a specific event (activity, timer, or signal) are called first
Next, any wait_untils. All outstanding handlers are triggered on every event.
Within each of the above categories, if there are multiple handlers, they are invoked in the order they were used in workflow code. This guarantees determinism as the workflow progresses.

I'm considering replacing the two uses cases for TARGET_WILDCARD (wait_until and old-style signal handlers that don't specify a signal name) with dedicated targets. This should make the code clearer, and ensure the old-style signal handlers continue to run before any wait_until handlers.

jeffschoner force-pushed the dw-repro branch from dffd7ba to 67f6796 Compare August 14, 2022 01:59

jeffschoner changed the base branch from master to incomplete-workflow-repro August 14, 2022 01:59

jeffschoner added 2 commits August 13, 2022 19:00

Extra verbose logging for dispatch handler

2c9f5a3

Add comment on how to resequence

5cd2bd8

jeffschoner force-pushed the dw-repro branch from 67f6796 to 5cd2bd8 Compare August 14, 2022 02:01

jeffschoner closed this Aug 15, 2022

jeffschoner mentioned this pull request Aug 15, 2022

Fix dispatch ordering of wait_until handlers coinbase/temporal-ruby#189

Merged

jeffschoner deleted the dw-repro branch October 5, 2022 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[wip] Add reproduction case for workflows not completing from signal #3

[wip] Add reproduction case for workflows not completing from signal #3

Uh oh!

jeffschoner commented Aug 14, 2022

Uh oh!

jeffschoner commented Aug 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[wip] Add reproduction case for workflows not completing from signal #3

[wip] Add reproduction case for workflows not completing from signal #3

Uh oh!

Conversation

jeffschoner commented Aug 14, 2022

A few observations/notes:

Uh oh!

jeffschoner commented Aug 15, 2022

BLUF

The old bug I was trying to fix

Suggested fix going forward

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant