Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Update Worker._executing handling #5500

Closed
wants to merge 3 commits into from

Conversation

jrbourbeau
Copy link
Member

This is a possible fix for #5497. With this change the example in #5497 no longer deadlocks. I'm pushing changes up early to test against the full CI suite

cc @gjoseph92 @fjetter

@jakirkham
Copy link
Member

Should we try to get this into the release if it does pass CI?

@gjoseph92
Copy link
Collaborator

gjoseph92 commented Nov 5, 2021

I'm curious if we could consolidate management of self._executing into Worker.execute. Would it work for execute to add the task right before it starts, and discard it right when it finishes (as it does with self.active_keys)?

One issue is that when going to executing, there’s often a loop.add_callback(self.execute, ...) in the transition function (transition_ready_executing, transition_constrained_executing). By adding the task to self._executing immediately, the transition function is essentially “reserving a spot” in the executing count before the task actually starts running. If we get rid of that, maybe there would be race conditions oversubscribing the threadpool.

But as far as removing it, why would you ever want to leave the task in self._executing after the function itself has completed?

(Note that trying variations of these changes locally, many tests seem to deadlock, so I'm sure there's a reason we don't do this—I'd just like to understand why.)

@jrbourbeau
Copy link
Member Author

Should we try to get this into the release if it does pass CI?

I'd feel more comfortable getting @fjetter's eyes on this before releasing it. I think this is okay, but my knowledge of the worker state machine is also not as up to date as it used to be pre-refactor

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Nov 5, 2021
@jakirkham
Copy link
Member

Ok. Happy to leave this out if you prefer.

Just figured if it helps solve a regression and is relatively simple, it might be worth putting in. If we don't think it is that simple (or there are subtle points that need further exploration), understand leaving it out.

@fjetter
Copy link
Member

fjetter commented Nov 8, 2021

Change looks OK. I am typically trying to avoid generic catch-all situations, e.g. I prefer a remove at the correct place instead of a discard in a catch all. However, I'm not sure if introducing another transition_cancelled_error helps readability. Thoughts?

I don't want this to be merged without a test. This situation looks as if easily provokable in a test, I'll look into it.

But as far as removing it, why would you ever want to leave the task in self._executing after the function itself has completed?

You don't. It should always be removed but I forgot that transition path. That's basically a philosophical question as outlined above. In my refactoring I tended to be very strict about these manners since I stumbled over so much dead code and undefined behaviour because a task was None, an attribute may be set, etc. I prefer spelling this all out explicitly, e.g. there should now be no situations possible where key not in self.tasks. We don't need to be that strict for everything though, see above comment.

Would it work for execute to add the task right before it starts, and discard it right when it finishes (as it does with self.active_keys)?

One issue is that when going to executing, there’s often a loop.add_callback(self.execute, ...) in the transition function (transition_ready_executing, transition_constrained_executing). By adding the task to self._executing immediately, the transition function is essentially “reserving a spot” in the executing count before the task actually starts running. If we get rid of that, maybe there would be race conditions oversubscribing the threadpool.

I believe you answered your own question there. I've been arround that block a few times. I dislike our entire interaction with the threadpool to be honest since it is very entangled with the state machine and not properly encapsulated. We have the same problem with data fetching coroutines, btw. we end up setting some state before we dispatch a coroutine and then need to put a lot of engineering around ensuring a proper state before/after. That's one of the primary drivers for the resumed/cancelled state mechnanics (Not the only one) and a very frequent source of past deadlocks. I'm very open to refactoring this part of the code as well. However, I would suggest starting with removing ensure_communicating and ensure_computing since they can be replaced now with the proper transitions system. That would make everythign a bit more traceable and eventually may even allow us to write unit tests on certain transition paths instead of relying on complex, artificial test constructs I've been writing for these problems. Anyhow, I'm drifting off. Bottom line, for now we need _executing in the way it is set right now and we should only refactor further once we reached stable ground.

@jrbourbeau
Copy link
Member Author

Closing in favor of #5503 -- thanks @fjetter!

@jrbourbeau jrbourbeau closed this Nov 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants