-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add explicit fetch
state to worker TaskState
#4470
Conversation
First pass looks good. Will give it a deeper review once you are more confident with the changes. If there are still failing tests this may trigger a bunch or more changes. |
Thanks for the initial pass, @fjetter -- I ran I'd like to make the state validation methods a bit more detailed but I think for now that will have to wait until we have the chained recommendation system in place and I'm trying not to do too much in this PR. Thanks again for all the work in diagnosing these issues! |
Any interest in rewriting this doc section?
https://distributed.dask.org/en/latest/worker.html#internal-scheduling
…On Mon, Feb 1, 2021 at 10:39 AM Gil Forsyth ***@***.***> wrote:
Thanks for the initial pass, @fjetter <https://github.com/fjetter> -- I
ran test_broken_worker_after_computation and
test_who_has_clears_after_failed_connection 100x each without any issue,
so I think this is stable and ready for a review.
I'd like to make the state validation methods a bit more detailed but I
think for now that will have to wait until we have the chained
recommendation system in place and I'm trying not to do too much in this PR.
Thanks again for all the work in diagnosing these issues!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4470 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEI4LV6PMQD6T2YFBLS43YNJANCNFSM4WZSY34A>
.
|
yes, on my list |
distributed/worker.py
Outdated
if dep_ts.state in ("fetch", "flight"): | ||
ts.waiting_for_data.add(dep_ts.key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, waiting_for_data
to be empty is one of the conditions for the transition waiting -> executing
and I'm wondering if this condition shouldn't also include waiting
and executing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if I include waiting
and executing
in here that it might cause weird transitions -- if the dependency is executing
and then we add it to waiting_for_data
while it is moving to memory, it will cause us to try to re-run this and we'll end up with a waiting -> memory
transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I think my comment was wrong. This is an issue about semantics. I perceived waiting_for_data
to be the same as "dependencies which are not yet in memory" but it is rather a "dependency which is not yet in memory and must be fetch from another worker". If this is true, your code changes make sense and I think there should be no other state in here
distributed/worker.py
Outdated
""" | ||
ts.runspec = None | ||
|
||
def transition_constrained_waiting(self, ts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is possible. Constrained is more or less the twin state of a waiting
task with the distinction that the constrained task has resource constraints to fulfil. If that's the case, this transition would mean we're changing the definition of the task by stripping the resource constraints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only be called if constrained task is stolen -- the current logic is there is that stolen tasks revert to a state of waiting
and are then released.
i think in this case we're only stripping the resource constraints because they're being enacted on a different worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it is stolen, the resource constraint of a task is defined on task initialisation, effectively by the user. ressource counting is done by the worker but the task should not be modfiied
it is managed here
distributed/distributed/worker.py
Lines 1725 to 1726 in 98570fb
for resource, quantity in ts.resource_restrictions.items(): | |
self.available_resources[resource] -= quantity |
and here
distributed/distributed/worker.py
Lines 1740 to 1741 in 98570fb
for resource, quantity in ts.resource_restrictions.items(): | |
self.available_resources[resource] += quantity |
but I would consider the field
resource_restrictions
to be immutable (if only that was easily possible in python :) )
Therefore, if my reasoning is correct, this might overshadow another wrong transition (maybe key is forgotten and improperly resubmitted?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current logic is there is that stolen tasks revert to a state of waiting and are then released.
Do you mean this line?
distributed/distributed/worker.py
Line 2285 in 98570fb
self.transition(ts, "waiting") |
I think it is wrong... :( This should show up somewhere in a verify task method. I think there should be something like
def validate_task_waiting(self, ts):
...
assert ts.resource_restrictions is None # // or empty dict, dont know
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good point re: resource_restrictions
-- I'll look again at the stealing code.
I don't know why the transition to waiting
was in there -- it seems like after we have confirmation that the task has been stolen, we should release the key on the victim side (along with whatever available_resources
might have been allocated for use)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll push up changes in a sec, I think we don't need the transition function called from steal_request
Regarding the check for resource_restrictions
, the current flow for a constrained
task is
new
-> waiting
(gather any remote dependencies) -> constrained
(because ts.resource_restrictions is not None
) -> executing
so we don't want to check that it's None
in validate_task_waiting
.
I DO think we could improve available_resources
tracking with a set of all the tasks that have currently occupied available resources, then doing a periodic check (in ensure_computing
, maybe?) that all of the keys in the set are in the executing
state and discard them if not.
That might be best left for a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it looks like steal_request
doesn't need to worry about releasing available_resources
because they aren't "taken" until the task transitions from ready -> executing
, at which point it's no longer eligible to be stolen
The constrained->waiting thing might connect to #4446 everything else LGTM |
Hmm, conversely, it might have been a fluke? It looks like perhaps |
Well, the runspec is reset a few lines above the release and yes, I tried to map as much as possible via transitions instead of inplace mutations to ensure all connected state information is properly modified (dependnts, dependencies, waiting for, etc.). I am still convinced that the way we're releasing tasks is also a bit buggy or at the very least not very transparent. I added this commit trying to handle this more transparently. It is currently otherwise very difficult to know what the release actually does. For example, is the key still in self.tasks or not?? That is determined by Ultimately, I can't tell you which direction is the right one, that's why my first PR escalated so badly :(
Do you have an idea which test deadlocked? I can try to run it on my machine and see if I can reproduce
What my last efforts debugging this taught me is that there is no such thing as an unrelated deadlock. It's all just a big ball of spaghetti ;) |
oof. too true. I was thinking it might've been a known flaky test but github actions doesn't hang on to logs for super long, I went looking for a previous similar timeout but didn't find one |
dbb2b33
to
501188d
Compare
501188d
to
4602117
Compare
Hey @fjetter -- just rebased against master and should be up to date now. If you have the opportunity to run this in one of your bigger workloads and see how it performs, that would be very helpful. I've run the test-suite locally a few hundred times at this point (focusing on work-stealing, worker failure, and regular worker tests) and things seem to be stable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll test this on our workloads and will report back.
Unfortunately, I hit #4538 when trying this so no results, yet |
I reverted the commit which likely caused #4538 to run some tests with this PR. Unfortunately, the very first try already hit a deadlock. I'm in an upscaling scenario where I start with one worker and am adding more and more, i.e. heavy work stealing. After some upscaling, the cluster distributes some of the tasks and works on them but only on about half. View from the scheduler (after it became stale) shows that the worker is supposedly still processing ~70k tasks and doesn't have any in memory. However, when asking the worker about its task, I can see that it has two tasks in memory. The state between scheduler and worker seems to have drifted significantly. Asking scheduler/worker about one of the processing tasks story, I get only a single entry |
Thanks for the report @fjetter -- and for the description of the failing test. I'll also try to work up a test case that follows that pattern and see if I can reproduce locally. |
I had another look, couldn't find any more useful logs but I noticed that our scheduler was overloaded the entire time, for whatever reason. that at least explains why the work stealing was working so poorly, but not why the initial worker was lazy and didn't process any results, nor why everything was entirely out of sync. Regardless, what I've seen didn't look healthy. I'll try to perform a few more test runs later today/tomorrow |
Had another test run which ran through but I received a lot of logs like
|
I hit the |
Changes:
|
@fjetter -- if you have an opportunity later this week to run your heavy stealing workload against this branch that would be a big help. Thanks! |
I am repeating myself but I believe we need to eventually rework the release mechanism. I have the feeling this is mostly educated guessing on when we must release a key, when we should and when we must not release it. Not sure how you feel about this. Either way, I'm glad if it works and reworking this is definitely out of scope for this PR
Yes, I think I can find time for this later today |
Oh, definitely. I think reworking the release mechanism and adding in a chained-transition system are the next steps once this is stable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this another spin and couldn't find anything wrong with it. That doesn't necessarily mean a lot but it's a very good start.
I'd be willing to merge this now but there is a release scheduled for tomorrow and I'm wondering if we want to let this sit a while before releasing. Opinions? cc @gforsyth @jrbourbeau
🎉 wooooooo! I'll defer to @jrbourbeau on whether we want to try to sneak this in to the next release or let it sit on master for a little bit |
I'd prefer to wait until after releasing to merge this in if that's alright |
Should only be set for `fetch` or `flight` -- possible following failed workers that a task will end up looking remote when it isn't, so we discard the key from `has_what` when a task is transitioned to `ready`
Tasks can transition from `fetch` -> `waiting` or `flight` -> `waiting` if the worker that has the data dies before the transfer is completed. If that task is reassigned to a worker that was expecting just the data, add the `runspec` and clear out any references to where the data _was_ coming from since it will now be executed locally.
Trying this out as it might not be needed anymore (but can revert if test complain)
There's a very real chance here for a task to be reassigned here and the extra task doesn't hurt anything.
E.g. don't add keys which will be fetch from other workers -- that information belongs in `TaskState.waiting_for_data`
We shouldn't call `release_key` in transition functions at all, it is very likely to cause mixups. There's an outside chance this leads to worker memory problems over very long jobs, but I think that is better handled by a cleanup callback.
Hey @jrbourbeau -- are we good to merge this now? |
Just a gentle ping on this to see if anyone has time to review further and/or merge |
I would go ahead and merge this. I'm just a bit hesitant since I'm lately not very involved and don't know if this interferes with any other change. @jrbourbeau can you give a quick "go ahead" / merge unless you are aware of any other big change which could interfere? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: move transition of `ts` to end of `add_task` | ||
# This will require a chained recommendation transition system like | ||
# the scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be done in this PR or is this follow-up work?
ts.waiting_for_data.add(dep_ts.key) | ||
self.waiting_for_data_count += 1 | ||
# check to ensure task wasn't already executed and partially released | ||
# # TODO: make this less bad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is bad ascetically or from a logic perspective. Is there anything else we should do here in this PR?
Hey @jrbourbeau! Yeah, both of those Todos are for the next PR which is going to add a chained recommendation system for transitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a pass at the new control flow that @fjetter laid out in #4413. Many thanks to him for the very detailed write-up and diagnosis of the issue.
To disambiguate a task that is to be executed vs. a task which is a data container (a dependency to be fetched from another worker) I've added a new state called
fetch
. The existing statewaiting
now only refers to tasks which will be executed on the worker.I think the
add_task
function needs to be cleaned up and I also agree with @fjetter that we should have a similar set of chained recommendations like the scheduler does, but hopefully this is a step in that direction.test_failed_workers.py::test_broken_worker_during_computation
fails one every 30 runs or so on my machine, I'd like to fix that before this gets merged in, but the more explicit states makes hunting down inconsistent state a fair bit easier.Also needs updates to docs for the new state and the new worker flow