Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Parallelize sync_end to remove async hang mechanism #39007

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tro3
Copy link

@tro3 tro3 commented Dec 26, 2020

Targets #32677, following up from #38916. This takes the parallel Experimental.sync_end and adds error handling (with CompositeException) to match the existing Base.sync_end. Adds lockup tests to both threads_exec and Distributed_exec test suites and points Experimental.@sync back at the new sync_end.

@StefanKarpinski , @JeffBezanson , @vtjnash - this is just submitting the results from those previous conversations. If you guys want to toss this, no issue - just wanted to give you the option. Feedback welcome.

@tro3 tro3 changed the title Parallelize sync_end to remove async hang mechanism RFC: Parallelize sync_end to remove async hang mechanism Dec 26, 2020
@conormckinley1999

This comment has been minimized.

@tro3
Copy link
Author

tro3 commented Jan 8, 2021

@StefanKarpinski , @JeffBezanson - bumping. I know you guys are busy, so if you want to push this one out or just kill it, I'm okay.

t = take!(c)
if t isa Exception # Exception from monitor. Collect
c_ex = CompositeException([t]) # any other exceptions and throw
while isready(c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this logic.

Don't you run the risk that the channel is empty, despite their being other tasks running which may fail at a later time?

Copy link
Author

@tro3 tro3 Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That there are running Tasks that fail at a later time is a very real possibility - once we throw the first Exception(s) in the Composite, we stop looking at the rest. But the only alternative is to wait for all Tasks to complete, which can cause the same hang we are trying to avoid in the first place. (eg if one of the running Tasks is waiting for Channel input from a dead one.) This PR doesn't get us true structured concurrency - it just guarantees an Exception is thrown to help with debugging the system.

@vchuravy vchuravy requested a review from JeffBezanson January 8, 2021 17:23
@tro3
Copy link
Author

tro3 commented Jan 17, 2021

@JeffBezanson - bumping. If you are too busy, no worries, but I have a window of about 11 days to respond to major feedback quickly. Just letting you know.

@StefanKarpinski StefanKarpinski added the triage This should be discussed on a triage call label Apr 7, 2021
@vtjnash vtjnash added forget me not PRs that one wants to make sure aren't forgotten and removed triage This should be discussed on a triage call labels Jun 3, 2021
@tro3
Copy link
Author

tro3 commented Jun 19, 2021

Gents - I have some time the next few weeks to turn in this direction, if there was something you all were still looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
forget me not PRs that one wants to make sure aren't forgotten
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants