Skip to content

Conversation

@jeffschoner
Copy link
Contributor

@dwillett discovered a bug in the ordering of the evaluation of wait_until handlers that was introduced in #183. The short version is that sometimes workflows end up stuck because a wait_until condition evaluates to false before another handler, such as a signal handler, is executed that would change the condition to true. The wait_until handler should always be called after the handlers for a specific event have been called.

More details of the bug and its behavior can be found in dwillett#3.

Summary of change

wait_until handlers are now called after event-specific handlers, undoing this part of the earlier PR. The behavior of signal handlers that do not specify a name, which also use dispatch target wildcards, has been preserved. Handlers for wait_untils have further separated from the other handlers to make this behavior difference clearer in code.

@chuckremes2 Some specs referencing dispatch_name and accompanying comments on the Dispatcher have been removed while I was in here. As far as I can tell, this is dead code. It appears to be left over from some changes you made at @antstorm's request on #157.

Testing

The dispatcher unit specs have been updated:

bundle exec rspec ./spec/unit/lib/temporal/workflow/dispatcher_spec.rb

The StartWithSignalWorkflow sample has been modified to cover @dwillett's repro of the issue. The sleeps in this workflow have also been replaced with wait_until which minorly improves execution performance. I've confirmed this modified test does indeed fail without the ordering fix in this PR.

This workflow runs as part of this existing integration spec:

cd examples
bundle exec rspec spec/integration/signal_with_start_spec.rb

Copy link
Contributor

@antstorm antstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find, thank you both @jeffschoner and @dwillett!

The solution makes sense, however I'm a bit worried about exposing the Dispatcher to some of these problems. I wonder if this is telling us that we got the boundaries of the Dispatcher a bit wrong… I would think of it as something that just dispatches the events to handlers without caring about the types of the events or how it is used for the workflow processing

RegistrationHandle.new(event_handlers[target], @next_id)
end

def register_wait_until_handler(&handler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rename this method into something that isn't coupled to the wait_until name since that might change and also there might be other uses of this method in the future. Maybe something like a register_unscoped_handler or register_wildcard_handler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two dimensions to each handler here: the order they run in and which events (ID/target and type) cause them to run. The wildcard aspect applies to both old-style signal handlers and wait_until handlers: they both match to all event targets. When they run is changing with this PR though. Before, the order was prescribed entirely by the order registered. Handlers reacting to a specific event must run first, then wait_until handlers need to run next.

I ended up naming this after the wait_until handlers, because they need this very specific behavior of running after the handlers that target a specific event. I suppose I could introduce a more generic approach here where the caller specifies a "priority" for the handler. When two handlers have the same priority, we would still break ties by using the order they were registered in.

Personally, I find the event-specific and wait_until-specific naming clearer because the ordering behavior is specific to these use cases. Another approach would be to lean into this approach, and move the wait_until handlers out of the dispatcher entirely. The code would be similar to the dispatcher, but much simpler since these do not need to match specific event IDs or types. I think I'd prefer this approach over specifying a priority, because I suspect the code would be easier to follow. I'll add another commit that gives this a try, assuming that it looks good to me otherwise.

@jeffschoner
Copy link
Contributor Author

@antstorm I tried a few different approaches to improving this based on your feedback. I think this one works the best. The dispatcher no longer needs to know about wait_until, but is still able to support running wait_until handlers after the others. It also eliminates the assumption that target-wildcard handlers must always be called at the end. These dimensions can be set independently when a handler is registered.

I used a default method parameter to minimize the number of places I'd need to update the code, but happy to go back and change all the call sites if you'd prefer the AT_BEGINNING and AT_END to be explicit.

@jeffschoner jeffschoner requested a review from antstorm September 5, 2022 21:04
Copy link
Contributor

@antstorm antstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffschoner this works great, thank you, really appreciate the extra effort! 🙌

@DeRauk DeRauk merged commit bb3f330 into coinbase:master Sep 6, 2022
dwillett pushed a commit to dwillett/temporal-ruby that referenced this pull request Sep 6, 2022
* Remove dead code from previous messy merge

* Separate wait_until handlers, execute at end

* Modify signal_with_start_workflow to cover dwillett's repro

* Decouple register_handler from wait_until
@jeffschoner jeffschoner deleted the dispatch-ordering branch October 5, 2022 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants