Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid transitioning from memory/released to missing in worker #6123

Merged
merged 18 commits into from
Apr 15, 2022

Conversation

mrocklin
Copy link
Member

This is to be used for more strenuous stress testing

This is to be used for more strenuous stress testing
mrocklin and others added 2 commits April 13, 2022 19:37
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Apr 14, 2022

Unit Test Results

       16 files  ±0         16 suites  ±0   8h 2m 34s ⏱️ + 17m 11s
  2 742 tests +1    2 660 ✔️ ±0       81 💤 +1  1 ±0 
21 821 runs  +8  20 785 ✔️ +6  1 035 💤 +2  1 ±0 

For more details on these failures, see this check.

Results for commit ee8065a. ± Comparison against base commit cdbb426.

♻️ This comment has been updated with latest results.

Is this right?  My guess is that things happened in the meantime that
mean that we don't actually want these tasks any more.  I'm not certain
though.
Comment on lines 3096 to 3097
if ts.state != "released":
recommendations[ts] = "fetch" if ts.who_has else "missing"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjetter @gjoseph92 what do you think about these changes?

We were seeing released -> missing transitions in the rechunk/killworker test (also in this branch now). My guess is that we just want to keep things as released. Is that correct in your view?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key must not be in released at this point. Do you have a story leading up to this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't reproduce, yet.


Keys are not allowed to be released at this point. This was one of the major sources of deadlocks last year that keys were released while either being fetched or executed that created horrible race conditions. That's why we have this weird cancelled state to ensure this cannot happen.

We should not add this condition in here, the finally clause of gather dep is likely the most buggy and fragile piece of code of our code base right now and I really don't want to add any other logic without a reliable way of triggering this condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't reproduce, yet.

You may have to run the test a few times. Also I agree with you about the oddities of erring in these methods. They should percolate upwards to the entire worker, and out to the client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running at scale with Coiled this fix does make the problem go away.

Comment on lines 3092 to 3097
recommendations[ts] = "fetch" if ts.who_has else "missing"
if ts.state != "released":
recommendations[ts] = "fetch" if ts.who_has else "missing"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not execute this entire clause now that we're dealing with the OSError above separately. Should also be something like

elif ts not in recommendations and ts.state in {"fetch", "flight"}:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very brittle. I hate it. We only want to execute parts of the finally clause...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another "brittle an very ugly" fix might be to create a set in the os error handler to remember the tasks we don't want to be handled here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ts not in recommendations is already basically handled a few lines above: https://github.com/dask/distributed/pull/6123/files#diff-0c7375c95aaa92986f8ed5460bae4e98720a14fca13556e01fd44ab6b927c24dR3089

Doesn't this clause only come into play if the get-data request succeeds, but the worker just doesn't have the key?

I don't understand how the task could be released here. I'm curious if just https://github.com/dask/distributed/pull/6123/files#r850533377 would fix things without your change here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elif ts not in recommendations and ts.state in {"fetch", "flight"}:

This change would cause a few tests to fail, one among them is distributed/tests/test_worker.py::test_task_flight_compute_oserror

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the task could be released here. I'm curious if just https://github.com/dask/distributed/pull/6123/files#r850533377 would fix things without your change here.

This doesn't seem to point to anything in particular. Maybe the link was broken by a push?

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the solution is pretty. Can you try setting up a more thorough test? This test only triggers the condition a bunch of times for me. Most recently I hit the TaskState in "memory" and not "released" problem.

@fjetter
Copy link
Member

fjetter commented Apr 14, 2022

We should not transition memory to missing in the OSError handler. If we change this and the test are green, I'm OK with merging. This should be already much better than what we have. This may only be a partial fix and we'll take another look at it after the release to have a solid test case.

Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrocklin it might be worth splitting this into two PRs. We could get the KillWorker plugin merged right now. Then discuss the fix to the deadlock found in test_chaos_rechunk separately.

Comment on lines 3092 to 3097
recommendations[ts] = "fetch" if ts.who_has else "missing"
if ts.state != "released":
recommendations[ts] = "fetch" if ts.who_has else "missing"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ts not in recommendations is already basically handled a few lines above: https://github.com/dask/distributed/pull/6123/files#diff-0c7375c95aaa92986f8ed5460bae4e98720a14fca13556e01fd44ab6b927c24dR3089

Doesn't this clause only come into play if the get-data request succeeds, but the worker just doesn't have the key?

I don't understand how the task could be released here. I'm curious if just https://github.com/dask/distributed/pull/6123/files#r850533377 would fix things without your change here.

@mrocklin
Copy link
Member Author

@mrocklin it might be worth splitting this into two PRs. We could get the KillWorker plugin merged right now. Then discuss the fix to the deadlock found in test_chaos_rechunk separately.

Yeah, I was thinking something similar, although for slightly different reasons. I agree with getting KillWorker in soon. I suspect that I'll also want to get the rest of the work (or a cleaned up version of it) in too, but that we'll want to revert it when we figure out a more grounded solution.

This is to be used for more strenuous stress testing
@mrocklin mrocklin mentioned this pull request Apr 14, 2022
@mrocklin mrocklin changed the title Add KillWorker worker plugin Add killworker test for array rechunk Apr 14, 2022
@mrocklin mrocklin changed the title Add killworker test for array rechunk Add KillWorker test and handle InvalidTransition issue Apr 14, 2022
@mrocklin
Copy link
Member Author

Well here is an interesting failure (rarely occurs, fortunately CI caught it)

  File "D:\a\distributed\distributed\distributed\worker.py", line 2647, in _handle_stimulus_from_task

    self.handle_stimulus(stim)

  File "D:\a\distributed\distributed\distributed\worker.py", line 2633, in handle_stimulus

    self.transitions(recs, stimulus_id=stim.stimulus_id)

  File "D:\a\distributed\distributed\distributed\worker.py", line 2609, in transitions

    a_recs, a_instructions = self._transition(

  File "D:\a\distributed\distributed\distributed\worker.py", line 2521, in _transition

    recs, instructions = func(ts, *args, stimulus_id=stimulus_id, **kwargs)

  File "D:\a\distributed\distributed\distributed\worker.py", line 2381, in transition_ready_executing

    assert all(

I think that this was in distributed/tests/test_client.py::test_threadsafe_get https://github.com/dask/distributed/runs/6032728170?check_suite_focus=true

I'm going to relax the change back to just the released state

@mrocklin
Copy link
Member Author

OK, I've got the test running with more sensitivity on my local machine (haven't pushed this up yet). I plan to walk back the changes to just released+memory (clearly things that weren't working before) rather than go ahead with an allow-list (harder to reason about in the short-term). I'm going to try expanding the list outwards and see what I get to. However, that'll be for tomorrow. Calling it a night.

@mrocklin mrocklin changed the title Add KillWorker test and handle InvalidTransition issue Avoid transitioning from memory/released to missing in worker Apr 15, 2022
@mrocklin
Copy link
Member Author

OK, this feels good to me. @gjoseph92 if you can take a look in the morning that would be welcome.

@mrocklin
Copy link
Member Author

Tests are good so far. Intermittent failure handled separately in #6131

Comment on lines +3094 to +3097
if ts.who_has:
recommendations[ts] = "fetch"
elif ts.state not in ("released", "memory"):
recommendations[ts] = "missing"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the task be in memory or missing even if who_has is non empty?

Guess I'm wondering why this isn't:

Suggested change
if ts.who_has:
recommendations[ts] = "fetch"
elif ts.state not in ("released", "memory"):
recommendations[ts] = "missing"
if ts.state not in ("released", "memory"):
recommendations[ts] = "fetch" if ts.who_has else "missing"

For that matter, could ts be errored? If ts was stolen to this worker (and executed) in the time since the gather_dep started, and execution failed? What about other terminal states?

I feel like an passlist here (likely fetch/flight) still would make more sense. If some action has occurred in the meantime since gather_dep started which has changed the task's state to something non-dependency-fetch-y, that probably takes precedence over re-fetching it right? I'd be curious if/why an passlist doesn't work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to make a change like what is above because it seems safe relative to what is here already.

When I implemented the allowlist / passlist tests failed, so our reasoning here is not yet correct. I think that we should absolutely pursue this, but that we should handle that separately. I'd like to get this in for a release and I don't want to increase scope much beyond "avoid released/memory -> missing transitions". I totally get that we might want to do more eventually, but I don't think that we need to do more in order to do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to make a change like what is above because it seems safe relative to what is here already.

Wait, no, it looks like that would exclude more paths. I'm really looking for a minimal set of changes that avoid the deadlock for right now.

@@ -3044,7 +3046,7 @@ async def gather_dep(
for d in has_what:
ts = self.tasks[d]
ts.who_has.remove(worker)
if not ts.who_has:
if not ts.who_has and ts.state not in ("released", "memory"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the failure in https://github.com/dask/distributed/runs/6032728170?check_suite_focus=true

            assert all(
                dep.key in self.data or dep.key in self.actors
                for dep in ts.dependencies
            )

but I'm not following why relaxing this from ts.state in ("fetch", "flight") caused it. I accept that it seems to work, but I'd like to understand it.

If you'd be willing to add an assert ts.state in ("fetch", "flight"), ts.state within this current conditional, I'd be very curious to see what fails in CI. I can't really make sense of why we'd want to transition tasks in any other state to missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand it.

I encourage you to understand it :)

If you'd be willing to add an assert ts.state in ("fetch", "flight"), ts.state within this current conditional, I'd be very curious to see what fails in CI

I think that we should do this in another PR and get this in as it is. I'd like to resolve this deadlock and get out a release.

@mrocklin
Copy link
Member Author

OK, I'm going to go ahead and merge this. The feedback here is good, and points us in a direction with two steps:

  1. We should figure out what is going on here
  2. We should try to expand the coverage of which other states we might want to disallow

No one seems to object that we should disallow these transitions though. I'm going to start here and we can expand as proves useful. Merging in 1hr for the release.

@gjoseph92
Copy link
Collaborator

I tested on the plane, and adding resumed to the allowlist seems to make it work instead of the blocklist you have here. I'm guessing that was the issue, since resumed is another "fetch-y" state.

@mrocklin
Copy link
Member Author

I think that I'm suspicious enough that I'm going to merge this in without adding the extra changes. (they seem to strictly increase the scope risk to me). I'll submit a second PR afterwards directly though.

@mrocklin
Copy link
Member Author

Following up in #6140

return
await asyncio.sleep(0.1)

await z.cancel()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this actually test? Every 4 seconds, all workers will shut down, losing all accrued progress, and will be start being restarted in chain by the nannies. Over time, timings will drift apart and reach a somewhat random distribution of restarts. It's close to impossible to reach z.status=="finished", as it would take way more than 4s to finish (it takes 13s without KillWorker on a speedy laptop). Because of the very high allowed_failures, status should never be failed. If the cluster deadlocks e.g. due to illegal transitions, the test won't notice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #6134

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants