Skip to content

Conversation

@jeffschoner
Copy link

@dwillett I've added extra logging to your repro. The output can be seen at https://gist.github.com/jeffschoner/a4be45f7f74f76480f13cd4624d60b2e

This is consistent with what I had sketched out over Slack, where the condition is evaluated before the timer callback runs. These are sequenced in this way because the wait_until happens before the timer is started (which only happens once the signal comes in which is after the workflow is started).

A few observations/notes:

The signal is sent so quickly after the workflow is started that it gets handled in the first workflow task. This simplifies what I mentioned in a Slack (2 workflow tasks instead of 3), but does not affect the behavior.

In the outputs, the handle registered for * * is for the wait_until. These are vaguely identified because they're not associated with a specific event. They run every time something in the workflow changes after they're registered (hence, the * for wildcard).

The two replays help a bit to show why dispatches need to be called in a deterministic way upon replay. The second workflow task begins exactly as the first ran. If this didn't runthis way, there's potential for non-determinism because there are little blocks of workflow code running between many of the log lines. A different order could mean the workflow exiting at a different point in execution (missing the start of a timer or activity), or not running an activity or timer in a callback
83), this sort of workflow would be broken.

When I restore the behavior of calling the wildcard handlers (aka the handlers for wait_until) after all the other handlers, your test does pass. I'm still working on a getting a repro for the bug I was seeing before that caused me to make coinbase#183. Unfortunately, the workflow where we saw this before is complex and uses a bunch of internal code, so it can't be surfaced publicly very easily.

@jeffschoner jeffschoner changed the base branch from master to incomplete-workflow-repro August 14, 2022 01:59
@jeffschoner
Copy link
Author

BLUF

I'm mostly convinced that the bug I thought was fixed by reordering the wildcard handlers was not, and that the behavior to run wait_until handlers after the others is correct. While I've been able to reproduce the errors I saw before in one of our workflows, I have not been able to create a repro that works on the current dispatch ordering but fails on the old dispatch ordering.

The old bug I was trying to fix

The failures I'd previously observed were related to the error produced when a command is sent up after workflow completion. Drew improved the error message here, but then I continued to investigate its origin in one of our workflows:

          raise Temporal::WorkflowAlreadyCompletingError.new(
            "You cannot do anything in a Workflow after it completes. #{context_string}, "\
            "but then it sent a new command: #{command.class}.  This can happen, for example, if you've "\
            "not waited for all of your Activity futures before finishing the Workflow."
          )

I thought this was related to non-deterministic ordering, but after digging through this some more today, I'm more convinced this was related to one callback resuming a fiber that completed the workflow, followed by another fiber resuming that tried to produce a side effect, due to activity futures that were not being properly waited upon before completing the workflow. Ultimately, I think this was deterministic within a single workflow run but not across workflow runs, where slower than usual activity execution delayed when the callback ran.

Antstorm had raised some questions about ordering the wildcard handlers in one of the earlier PRs, or perhaps in discussions on Temporal Slack. I can't find this conversation but remember having it. I think I overcorrected to his questions in making coinbase#183, which also contained important performance fixes.

Suggested fix going forward

Dispatching order should be:

  • Any handlers tied to a specific event (activity, timer, or signal) are called first
  • Next, any wait_untils. All outstanding handlers are triggered on every event.
  • Within each of the above categories, if there are multiple handlers, they are invoked in the order they were used in workflow code. This guarantees determinism as the workflow progresses.

I'm considering replacing the two uses cases for TARGET_WILDCARD (wait_until and old-style signal handlers that don't specify a signal name) with dedicated targets. This should make the code clearer, and ensure the old-style signal handlers continue to run before any wait_until handlers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant