[wip] Add reproduction case for workflows not completing from signal #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@dwillett I've added extra logging to your repro. The output can be seen at https://gist.github.com/jeffschoner/a4be45f7f74f76480f13cd4624d60b2e
This is consistent with what I had sketched out over Slack, where the condition is evaluated before the timer callback runs. These are sequenced in this way because the
wait_untilhappens before the timer is started (which only happens once the signal comes in which is after the workflow is started).A few observations/notes:
The signal is sent so quickly after the workflow is started that it gets handled in the first workflow task. This simplifies what I mentioned in a Slack (2 workflow tasks instead of 3), but does not affect the behavior.
In the outputs, the handle registered for * * is for the
wait_until. These are vaguely identified because they're not associated with a specific event. They run every time something in the workflow changes after they're registered (hence, the*for wildcard).The two replays help a bit to show why dispatches need to be called in a deterministic way upon replay. The second workflow task begins exactly as the first ran. If this didn't runthis way, there's potential for non-determinism because there are little blocks of workflow code running between many of the log lines. A different order could mean the workflow exiting at a different point in execution (missing the start of a timer or activity), or not running an activity or timer in a callback
83), this sort of workflow would be broken.
When I restore the behavior of calling the wildcard handlers (aka the handlers for
wait_until) after all the other handlers, your test does pass. I'm still working on a getting a repro for the bug I was seeing before that caused me to make coinbase#183. Unfortunately, the workflow where we saw this before is complex and uses a bunch of internal code, so it can't be surfaced publicly very easily.