Add explicit `fetch` state to worker TaskState #4470

gforsyth · 2021-01-29T21:48:22Z

Taking a pass at the new control flow that @fjetter laid out in #4413. Many thanks to him for the very detailed write-up and diagnosis of the issue.

To disambiguate a task that is to be executed vs. a task which is a data container (a dependency to be fetched from another worker) I've added a new state called fetch. The existing state waiting now only refers to tasks which will be executed on the worker.

I think the add_task function needs to be cleaned up and I also agree with @fjetter that we should have a similar set of chained recommendations like the scheduler does, but hopefully this is a step in that direction.

test_failed_workers.py::test_broken_worker_during_computation fails one every 30 runs or so on my machine, I'd like to fix that before this gets merged in, but the more explicit states makes hunting down inconsistent state a fair bit easier.

Also needs updates to docs for the new state and the new worker flow

fjetter · 2021-02-01T08:09:26Z

First pass looks good. Will give it a deeper review once you are more confident with the changes. If there are still failing tests this may trigger a bunch or more changes.
I can highly recommend the suggestion to run the flaky tests in an ipython shell to speed up development. At least the gen_cluster annotated tests can be simply imported and executed as an ordinary function

gforsyth · 2021-02-01T18:38:59Z

Thanks for the initial pass, @fjetter -- I ran test_broken_worker_after_computation and test_who_has_clears_after_failed_connection 100x each without any issue, so I think this is stable and ready for a review.

I'd like to make the state validation methods a bit more detailed but I think for now that will have to wait until we have the chained recommendation system in place and I'm trying not to do too much in this PR.

Thanks again for all the work in diagnosing these issues!

mrocklin · 2021-02-01T18:59:16Z

Any interest in rewriting this doc section? https://distributed.dask.org/en/latest/worker.html#internal-scheduling

…

On Mon, Feb 1, 2021 at 10:39 AM Gil Forsyth ***@***.***> wrote: Thanks for the initial pass, @fjetter <https://github.com/fjetter> -- I ran test_broken_worker_after_computation and test_who_has_clears_after_failed_connection 100x each without any issue, so I think this is stable and ready for a review. I'd like to make the state validation methods a bit more detailed but I think for now that will have to wait until we have the chained recommendation system in place and I'm trying not to do too much in this PR. Thanks again for all the work in diagnosing these issues! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4470 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTEI4LV6PMQD6T2YFBLS43YNJANCNFSM4WZSY34A> .

gforsyth · 2021-02-01T19:00:37Z

Any interest in rewriting this doc section? https://distributed.dask.org/en/latest/worker.html#internal-scheduling

yes, on my list

fjetter · 2021-02-02T08:07:32Z

distributed/worker.py

+                    if dep_ts.state in ("fetch", "flight"):
+                        ts.waiting_for_data.add(dep_ts.key)


IIRC, waiting_for_data to be empty is one of the conditions for the transition waiting -> executing and I'm wondering if this condition shouldn't also include waiting and executing

I think that if I include waiting and executing in here that it might cause weird transitions -- if the dependency is executing and then we add it to waiting_for_data while it is moving to memory, it will cause us to try to re-run this and we'll end up with a waiting -> memory transition.

sorry, I think my comment was wrong. This is an issue about semantics. I perceived waiting_for_data to be the same as "dependencies which are not yet in memory" but it is rather a "dependency which is not yet in memory and must be fetch from another worker". If this is true, your code changes make sense and I think there should be no other state in here

distributed/worker.py

fjetter · 2021-02-02T08:19:40Z

distributed/worker.py

+        """
+        ts.runspec = None
+
+    def transition_constrained_waiting(self, ts):


I'm not sure if this is possible. Constrained is more or less the twin state of a waiting task with the distinction that the constrained task has resource constraints to fulfil. If that's the case, this transition would mean we're changing the definition of the task by stripping the resource constraints

This should only be called if constrained task is stolen -- the current logic is there is that stolen tasks revert to a state of waiting and are then released.
i think in this case we're only stripping the resource constraints because they're being enacted on a different worker.

Even if it is stolen, the resource constraint of a task is defined on task initialisation, effectively by the user. ressource counting is done by the worker but the task should not be modfiied

it is managed here

distributed/distributed/worker.py

Lines 1725 to 1726 in 98570fb

for resource, quantity in ts.resource_restrictions.items():

self.available_resources[resource] -= quantity

and here

distributed/distributed/worker.py

Lines 1740 to 1741 in 98570fb

for resource, quantity in ts.resource_restrictions.items():

self.available_resources[resource] += quantity

but I would consider the field resource_restrictions to be immutable (if only that was easily possible in python :) )

Therefore, if my reasoning is correct, this might overshadow another wrong transition (maybe key is forgotten and improperly resubmitted?)

the current logic is there is that stolen tasks revert to a state of waiting and are then released.

Do you mean this line?

distributed/distributed/worker.py

Line 2285 in 98570fb

self.transition(ts, "waiting")

I think it is wrong... :( This should show up somewhere in a verify task method. I think there should be something like

def validate_task_waiting(self, ts): ... assert ts.resource_restrictions is None # // or empty dict, dont know ...

That's a very good point re: resource_restrictions -- I'll look again at the stealing code.
I don't know why the transition to waiting was in there -- it seems like after we have confirmation that the task has been stolen, we should release the key on the victim side (along with whatever available_resources might have been allocated for use)

Ok, I'll push up changes in a sec, I think we don't need the transition function called from steal_request

Regarding the check for resource_restrictions, the current flow for a constrained task is

new -> waiting (gather any remote dependencies) -> constrained (because ts.resource_restrictions is not None) -> executing

so we don't want to check that it's None in validate_task_waiting.

I DO think we could improve available_resources tracking with a set of all the tasks that have currently occupied available resources, then doing a periodic check (in ensure_computing, maybe?) that all of the keys in the set are in the executing state and discard them if not.
That might be best left for a separate PR

Also, it looks like steal_request doesn't need to worry about releasing available_resources because they aren't "taken" until the task transitions from ready -> executing, at which point it's no longer eligible to be stolen

distributed/worker.py

fjetter · 2021-02-03T15:47:54Z

The constrained->waiting thing might connect to #4446

everything else LGTM

gforsyth · 2021-02-03T21:49:29Z

Damn, moving to the non-transition steal_request flow led to a deadlock and I can't track it down locally.

@fjetter you added this in 1fc1b46 as a part of #4432 -- do you remember why that transition was there? Was it solely to remove the runspec? Or to handle dependents a little more gracefully?

gforsyth · 2021-02-03T22:05:18Z

Hmm, conversely, it might have been a fluke? It looks like perhaps pytest test_client_executor.py::test_cancellation hung, which seems possibly unrelated

fjetter · 2021-02-04T10:00:18Z

@fjetter you added this in 1fc1b46 as a part of #4432 -- do you remember why that transition was there? Was it solely to remove the runspec? Or to handle dependents a little more gracefully?

Well, the runspec is reset a few lines above the release and yes, I tried to map as much as possible via transitions instead of inplace mutations to ensure all connected state information is properly modified (dependnts, dependencies, waiting for, etc.). I am still convinced that the way we're releasing tasks is also a bit buggy or at the very least not very transparent. I added this commit trying to handle this more transparently. It is currently otherwise very difficult to know what the release actually does. For example, is the key still in self.tasks or not?? That is determined by ts.dependents but why do I need to check this attribute myself before calling that method? Sometimes you want to be some of release_key but all of it but it's not very clear to me when we do need what. Eventually we might want to implement this more clearly by having another virtual state like forgotten, similar to how the scheduler deals with this but that's yet another change. At the very least we should fix this kind of leaky abstraction and make it "easier" to call release_key such that the caller knows for a fact what is going to happen. I'm drifting off...

Ultimately, I can't tell you which direction is the right one, that's why my first PR escalated so badly :(

Damn, moving to the non-transition steal_request flow led to a deadlock and I can't track it down locally.

Do you have an idea which test deadlocked? I can try to run it on my machine and see if I can reproduce

It looks like perhaps pytest test_client_executor.py::test_cancellation hung, which seems possibly unrelated

What my last efforts debugging this taught me is that there is no such thing as an unrelated deadlock. It's all just a big ball of spaghetti ;)
I'll try to reproduce (if I can, I'll try the same on master and we'll see where we are)

gforsyth · 2021-02-04T15:05:44Z

there is no such thing as an unrelated deadlock

oof. too true. I was thinking it might've been a known flaky test but github actions doesn't hang on to logs for super long, I went looking for a previous similar timeout but didn't find one

gforsyth · 2021-02-22T15:17:58Z

Hey @fjetter -- just rebased against master and should be up to date now. If you have the opportunity to run this in one of your bigger workloads and see how it performs, that would be very helpful. I've run the test-suite locally a few hundred times at this point (focusing on work-stealing, worker failure, and regular worker tests) and things seem to be stable.

fjetter

I'll test this on our workloads and will report back.

fjetter · 2021-02-23T12:56:10Z

Unfortunately, I hit #4538 when trying this so no results, yet

fjetter · 2021-02-23T14:28:48Z

I reverted the commit which likely caused #4538 to run some tests with this PR. Unfortunately, the very first try already hit a deadlock. I'm in an upscaling scenario where I start with one worker and am adding more and more, i.e. heavy work stealing. After some upscaling, the cluster distributes some of the tasks and works on them but only on about half. View from the scheduler (after it became stale) shows that the worker is supposedly still processing ~70k tasks and doesn't have any in memory. However, when asking the worker about its task, I can see that it has two tasks in memory. The state between scheduler and worker seems to have drifted significantly.

Asking scheduler/worker about one of the processing tasks story, I get only a single entry [("<key>", 'release-key')]. Couldn't find any stealing logs for the key I was looking at either.

gforsyth · 2021-02-23T14:48:59Z

Thanks for the report @fjetter -- and for the description of the failing test. I'll also try to work up a test case that follows that pattern and see if I can reproduce locally.

fjetter · 2021-02-23T15:07:21Z

I had another look, couldn't find any more useful logs but I noticed that our scheduler was overloaded the entire time, for whatever reason. that at least explains why the work stealing was working so poorly, but not why the initial worker was lazy and didn't process any results, nor why everything was entirely out of sync. Regardless, what I've seen didn't look healthy. I'll try to perform a few more test runs later today/tomorrow

fjetter · 2021-02-25T10:30:45Z

Had another test run which ran through but I received a lot of logs like

('fetch', 'memory')
Traceback (most recent call last):
  File "/.../lib/python3.6/site-packages/distributed/utils.py", line 655, in log_errors
    yield
  File "/.../lib/python3.6/site-packages/distributed/worker.py", line 2256, in gather_dep
    self.transition(ts, "memory", value=data[d])
  File "/.../lib/python3.6/site-packages/distributed/worker.py", line 1576, in transition
    func = self._transitions[start, finish]
KeyError: ('fetch', 'memory')

gforsyth · 2021-03-02T22:06:43Z

I hit the KeyError reliably when running tests/test_steal.py::test_steal_twice -- trying to figure out the root cause now, but it definitely occurs as a knock-on effect from a dependency being stolen, so that's encouraging.

gforsyth · 2021-03-03T20:12:30Z

Changes:

If there's a knock-on release_key call but the task is in flight, just leave it alone. Worst case is an extra entry in the self.tasks dictionary
Reserve self.data_needed for tasks which will be executed locally (that is, only list the tasks for which we need to fetch data, don't include the tasks to-be-fetched)
Don't call release_key in a transition function, ever.

gforsyth · 2021-03-03T21:26:22Z

@fjetter -- if you have an opportunity later this week to run your heavy stealing workload against this branch that would be a big help. Thanks!

fjetter · 2021-03-04T09:53:50Z

I am repeating myself but I believe we need to eventually rework the release mechanism. I have the feeling this is mostly educated guessing on when we must release a key, when we should and when we must not release it. Not sure how you feel about this. Either way, I'm glad if it works and reworking this is definitely out of scope for this PR

if you have an opportunity later this week to run your heavy stealing workload against this branch that would be a big help. Thanks!

Yes, I think I can find time for this later today

gforsyth · 2021-03-04T14:30:20Z

I am repeating myself but I believe we need to eventually rework the release mechanism.

Oh, definitely. I think reworking the release mechanism and adding in a chained-transition system are the next steps once this is stable.

fjetter

I gave this another spin and couldn't find anything wrong with it. That doesn't necessarily mean a lot but it's a very good start.

I'd be willing to merge this now but there is a release scheduled for tomorrow and I'm wondering if we want to let this sit a while before releasing. Opinions? cc @gforsyth @jrbourbeau

gforsyth · 2021-03-04T17:08:43Z

I gave this another spin and couldn't find anything wrong with it. That doesn't necessarily mean a lot but it's a very good start.

🎉 wooooooo!

I'll defer to @jrbourbeau on whether we want to try to sneak this in to the next release or let it sit on master for a little bit

jrbourbeau · 2021-03-04T17:08:57Z

I'd prefer to wait until after releasing to merge this in if that's alright

Should only be set for `fetch` or `flight` -- possible following failed workers that a task will end up looking remote when it isn't, so we discard the key from `has_what` when a task is transitioned to `ready`

Tasks can transition from `fetch` -> `waiting` or `flight` -> `waiting` if the worker that has the data dies before the transfer is completed. If that task is reassigned to a worker that was expecting just the data, add the `runspec` and clear out any references to where the data _was_ coming from since it will now be executed locally.

Trying this out as it might not be needed anymore (but can revert if test complain)

There's a very real chance here for a task to be reassigned here and the extra task doesn't hurt anything.

E.g. don't add keys which will be fetch from other workers -- that information belongs in `TaskState.waiting_for_data`

We shouldn't call `release_key` in transition functions at all, it is very likely to cause mixups. There's an outside chance this leads to worker memory problems over very long jobs, but I think that is better handled by a cleanup callback.

gforsyth · 2021-03-08T14:42:24Z

Hey @jrbourbeau -- are we good to merge this now?

gforsyth · 2021-03-15T15:51:31Z

Just a gentle ping on this to see if anyone has time to review further and/or merge

fjetter · 2021-03-16T12:52:56Z

I would go ahead and merge this. I'm just a bit hesitant since I'm lately not very involved and don't know if this interferes with any other change. @jrbourbeau can you give a quick "go ahead" / merge unless you are aware of any other big change which could interfere?

jrbourbeau

Thanks @gforsyth and @fjetter for your work here -- apologies for the delayed reply. After a quick look this seems good to merge (there are a couple of TODOs where it'd be good to confirm if they are talking about future work or things we should include here)

jrbourbeau · 2021-03-17T19:45:13Z

distributed/worker.py

+            # TODO: move transition of `ts` to end of `add_task`
+            # This will require a chained recommendation transition system like
+            # the scheduler


Does this need to be done in this PR or is this follow-up work?

jrbourbeau · 2021-03-17T19:46:40Z

distributed/worker.py

-                    ts.waiting_for_data.add(dep_ts.key)
-                    self.waiting_for_data_count += 1
+                    # check to ensure task wasn't already executed and partially released
+                    # # TODO: make this less bad


I'm not sure if this is bad ascetically or from a logic perspective. Is there anything else we should do here in this PR?

gforsyth · 2021-03-18T14:18:53Z

Hey @jrbourbeau! Yeah, both of those Todos are for the next PR which is going to add a chained recommendation system for transitions.

jrbourbeau

Thanks @gforsyth @fjetter!

gforsyth marked this pull request as ready for review February 1, 2021 18:35

fjetter reviewed Feb 2, 2021

View reviewed changes

gforsyth force-pushed the fetch_and_wait branch from dbb2b33 to 501188d Compare February 12, 2021 15:45

gforsyth force-pushed the fetch_and_wait branch from 501188d to 4602117 Compare February 22, 2021 14:27

fjetter reviewed Feb 23, 2021

View reviewed changes

gforsyth mentioned this pull request Feb 26, 2021

Worker and scheduler disagree about key/dep's status #4550

Closed

fjetter approved these changes Mar 4, 2021

View reviewed changes

Add state "new" and "fetch" to worker

c422b32

gforsyth and others added 18 commits March 4, 2021 12:33

Only use who_has and has_what for non-local tasks

4b467d9

Should only be set for `fetch` or `flight` -- possible following failed workers that a task will end up looking remote when it isn't, so we discard the key from `has_what` when a task is transitioned to `ready`

black

7ee6797

validate method for fetch state

7498285

add try/catch to new transition functions

40a5ae3

Remove commented out code

7224111

Update docs on internal scheduling

66171d7

Fuse state checks and gate data_needed

8f0007d

Remove remove= kwarg from fetch transition

d577751

Trying this out as it might not be needed anymore (but can revert if test complain)

Set runspec to positional arg in rescheduling transitions

56ad76e

Remove stealing transitions

da831f4

Fix for deprecation warning in tests

3f40266

Add "cause" to most calls to release_key

f50ab4e

Don't release keys in flight

e7b86de

There's a very real chance here for a task to be reassigned here and the extra task doesn't hurt anything.

Reserve self.data_needed for keys that will be executed

6259063

E.g. don't add keys which will be fetch from other workers -- that information belongs in `TaskState.waiting_for_data`

Don't discard result in transition

cdeb85c

We shouldn't call `release_key` in transition functions at all, it is very likely to cause mixups. There's an outside chance this leads to worker memory problems over very long jobs, but I think that is better handled by a cleanup callback.

Add a few comments and swap in a set comprehension

94d184a

Check in-flight task count in steal test

47ada8e

gforsyth force-pushed the fetch_and_wait branch from 506e362 to d3d0aa8 Compare March 4, 2021 17:34

Remove in-transition call to release_key

14d2904

gforsyth force-pushed the fetch_and_wait branch from d3d0aa8 to 14d2904 Compare March 5, 2021 15:35

Base automatically changed from master to main March 8, 2021 19:04

jrbourbeau reviewed Mar 17, 2021

View reviewed changes

jrbourbeau approved these changes Mar 18, 2021

View reviewed changes

jrbourbeau merged commit 98148cb into dask:main Mar 18, 2021

gforsyth deleted the fetch_and_wait branch March 18, 2021 15:09

		if dep_ts.state in ("fetch", "flight"):
		ts.waiting_for_data.add(dep_ts.key)

	for resource, quantity in ts.resource_restrictions.items():
	self.available_resources[resource] -= quantity

	for resource, quantity in ts.resource_restrictions.items():
	self.available_resources[resource] += quantity

Add explicit fetch state to worker TaskState #4470

Add explicit fetch state to worker TaskState #4470

Conversation

gforsyth commented Jan 29, 2021

fjetter commented Feb 1, 2021

gforsyth commented Feb 1, 2021 • edited Loading

mrocklin commented Feb 1, 2021 via email

gforsyth commented Feb 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Feb 3, 2021

gforsyth commented Feb 3, 2021 • edited Loading

gforsyth commented Feb 3, 2021

fjetter commented Feb 4, 2021 • edited Loading

gforsyth commented Feb 4, 2021

gforsyth commented Feb 22, 2021

fjetter left a comment

Choose a reason for hiding this comment

fjetter commented Feb 23, 2021

fjetter commented Feb 23, 2021 • edited Loading

gforsyth commented Feb 23, 2021

fjetter commented Feb 23, 2021

fjetter commented Feb 25, 2021

gforsyth commented Mar 2, 2021

gforsyth commented Mar 3, 2021

gforsyth commented Mar 3, 2021

fjetter commented Mar 4, 2021

gforsyth commented Mar 4, 2021

fjetter left a comment

Choose a reason for hiding this comment

gforsyth commented Mar 4, 2021

jrbourbeau commented Mar 4, 2021

gforsyth commented Mar 8, 2021

gforsyth commented Mar 15, 2021

fjetter commented Mar 16, 2021

jrbourbeau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gforsyth commented Mar 18, 2021

jrbourbeau left a comment

Choose a reason for hiding this comment

Add explicit `fetch` state to worker TaskState #4470

Add explicit `fetch` state to worker TaskState #4470

gforsyth commented Feb 1, 2021 •

edited

Loading

gforsyth commented Feb 3, 2021 •

edited

Loading

fjetter commented Feb 4, 2021 •

edited

Loading

fjetter commented Feb 23, 2021 •

edited

Loading