[WIP] Update `Worker._executing` handling #5500

jrbourbeau · 2021-11-05T21:22:50Z

This is a possible fix for #5497. With this change the example in #5497 no longer deadlocks. I'm pushing changes up early to test against the full CI suite

Closes Deadlock: subsequent computations deadlock after a task has errored #5497
Tests added / passed
Passes pre-commit run --all-files

cc @gjoseph92 @fjetter

jakirkham · 2021-11-05T21:27:58Z

Should we try to get this into the release if it does pass CI?

gjoseph92 · 2021-11-05T21:29:11Z

I'm curious if we could consolidate management of self._executing into Worker.execute. Would it work for execute to add the task right before it starts, and discard it right when it finishes (as it does with self.active_keys)?

One issue is that when going to executing, there’s often a loop.add_callback(self.execute, ...) in the transition function (transition_ready_executing, transition_constrained_executing). By adding the task to self._executing immediately, the transition function is essentially “reserving a spot” in the executing count before the task actually starts running. If we get rid of that, maybe there would be race conditions oversubscribing the threadpool.

But as far as removing it, why would you ever want to leave the task in self._executing after the function itself has completed?

(Note that trying variations of these changes locally, many tests seem to deadlock, so I'm sure there's a reason we don't do this—I'd just like to understand why.)

jrbourbeau · 2021-11-05T21:35:15Z

Should we try to get this into the release if it does pass CI?

I'd feel more comfortable getting @fjetter's eyes on this before releasing it. I think this is okay, but my knowledge of the worker state machine is also not as up to date as it used to be pre-refactor

An alternative to dask#5500 for solving dask#5497. See dask#5500 (comment) for details.

jakirkham · 2021-11-05T21:43:56Z

Ok. Happy to leave this out if you prefer.

Just figured if it helps solve a regression and is relatively simple, it might be worth putting in. If we don't think it is that simple (or there are subtle points that need further exploration), understand leaving it out.

fjetter · 2021-11-08T09:35:23Z

Change looks OK. I am typically trying to avoid generic catch-all situations, e.g. I prefer a remove at the correct place instead of a discard in a catch all. However, I'm not sure if introducing another transition_cancelled_error helps readability. Thoughts?

I don't want this to be merged without a test. This situation looks as if easily provokable in a test, I'll look into it.

But as far as removing it, why would you ever want to leave the task in self._executing after the function itself has completed?

You don't. It should always be removed but I forgot that transition path. That's basically a philosophical question as outlined above. In my refactoring I tended to be very strict about these manners since I stumbled over so much dead code and undefined behaviour because a task was None, an attribute may be set, etc. I prefer spelling this all out explicitly, e.g. there should now be no situations possible where key not in self.tasks. We don't need to be that strict for everything though, see above comment.

Would it work for execute to add the task right before it starts, and discard it right when it finishes (as it does with self.active_keys)?

One issue is that when going to executing, there’s often a loop.add_callback(self.execute, ...) in the transition function (transition_ready_executing, transition_constrained_executing). By adding the task to self._executing immediately, the transition function is essentially “reserving a spot” in the executing count before the task actually starts running. If we get rid of that, maybe there would be race conditions oversubscribing the threadpool.

I believe you answered your own question there. I've been arround that block a few times. I dislike our entire interaction with the threadpool to be honest since it is very entangled with the state machine and not properly encapsulated. We have the same problem with data fetching coroutines, btw. we end up setting some state before we dispatch a coroutine and then need to put a lot of engineering around ensuring a proper state before/after. That's one of the primary drivers for the resumed/cancelled state mechnanics (Not the only one) and a very frequent source of past deadlocks. I'm very open to refactoring this part of the code as well. However, I would suggest starting with removing ensure_communicating and ensure_computing since they can be replaced now with the proper transitions system. That would make everythign a bit more traceable and eventually may even allow us to write unit tests on certain transition paths instead of relying on complex, artificial test constructs I've been writing for these problems. Anyhow, I'm drifting off. Bottom line, for now we need _executing in the way it is set right now and we should only refactor further once we reached stable ground.

jrbourbeau · 2021-11-18T21:48:17Z

Closing in favor of #5503 -- thanks @fjetter!

Update Worker._executing handling

408cdb3

One more

cd6f664

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Nov 5, 2021

Consolidate management of Worker._executing

3e37d7d

An alternative to dask#5500 for solving dask#5497. See dask#5500 (comment) for details.

gjoseph92 mentioned this pull request Nov 5, 2021

Consolidate management of Worker._executing #5501

Closed

3 tasks

executing -> released goes through intermediate cancelled transision

1f3c619

fjetter mentioned this pull request Nov 8, 2021

Fix deadlock caused by an erred task (executing->cancelled->error) #5503

Merged

jrbourbeau closed this Nov 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Update `Worker._executing` handling #5500

[WIP] Update `Worker._executing` handling #5500

jrbourbeau commented Nov 5, 2021

jakirkham commented Nov 5, 2021

gjoseph92 commented Nov 5, 2021 •

edited

Loading

jrbourbeau commented Nov 5, 2021

jakirkham commented Nov 5, 2021

fjetter commented Nov 8, 2021

jrbourbeau commented Nov 18, 2021

[WIP] Update Worker._executing handling #5500

[WIP] Update Worker._executing handling #5500

Conversation

jrbourbeau commented Nov 5, 2021

jakirkham commented Nov 5, 2021

gjoseph92 commented Nov 5, 2021 • edited Loading

jrbourbeau commented Nov 5, 2021

jakirkham commented Nov 5, 2021

fjetter commented Nov 8, 2021

jrbourbeau commented Nov 18, 2021

[WIP] Update `Worker._executing` handling #5500

[WIP] Update `Worker._executing` handling #5500

gjoseph92 commented Nov 5, 2021 •

edited

Loading