Avoid transitioning from memory/released to missing in worker #6123

mrocklin · 2022-04-13T22:48:27Z

This is to be used for more strenuous stress testing

distributed/chaos.py

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

github-actions · 2022-04-14T05:02:17Z

Unit Test Results

      16 files ±0       16 suites ±0 8h 2m 34s ⏱️ + 17m 11s
  2 742 tests +1   2 660 ✔️ ±0     81 💤 +1 1 ❌ ±0
21 821 runs +8 20 785 ✔️ +6 1 035 💤 +2 1 ❌ ±0

For more details on these failures, see this check.

Results for commit ee8065a. ± Comparison against base commit cdbb426.

♻️ This comment has been updated with latest results.

Is this right? My guess is that things happened in the meantime that mean that we don't actually want these tasks any more. I'm not certain though.

mrocklin · 2022-04-14T12:28:19Z

distributed/worker.py

+                        if ts.state != "released":
+                            recommendations[ts] = "fetch" if ts.who_has else "missing"


@fjetter @gjoseph92 what do you think about these changes?

We were seeing released -> missing transitions in the rechunk/killworker test (also in this branch now). My guess is that we just want to keep things as released. Is that correct in your view?

The key must not be in released at this point. Do you have a story leading up to this?

Couldn't reproduce, yet.

Keys are not allowed to be released at this point. This was one of the major sources of deadlocks last year that keys were released while either being fetched or executed that created horrible race conditions. That's why we have this weird cancelled state to ensure this cannot happen.

We should not add this condition in here, the finally clause of gather dep is likely the most buggy and fragile piece of code of our code base right now and I really don't want to add any other logic without a reliable way of triggering this condition.

Couldn't reproduce, yet.

You may have to run the test a few times. Also I agree with you about the oddities of erring in these methods. They should percolate upwards to the entire worker, and out to the client.

Running at scale with Coiled this fix does make the problem go away.

distributed/tests/test_stress.py

distributed/worker.py

fjetter · 2022-04-14T15:00:24Z

distributed/worker.py

-                        recommendations[ts] = "fetch" if ts.who_has else "missing"
+                        if ts.state != "released":
+                            recommendations[ts] = "fetch" if ts.who_has else "missing"


We should not execute this entire clause now that we're dealing with the OSError above separately. Should also be something like

elif ts not in recommendations and ts.state in {"fetch", "flight"}:

This is very brittle. I hate it. We only want to execute parts of the finally clause...

another "brittle an very ugly" fix might be to create a set in the os error handler to remember the tasks we don't want to be handled here.

ts not in recommendations is already basically handled a few lines above: https://github.com/dask/distributed/pull/6123/files#diff-0c7375c95aaa92986f8ed5460bae4e98720a14fca13556e01fd44ab6b927c24dR3089

Doesn't this clause only come into play if the get-data request succeeds, but the worker just doesn't have the key?

I don't understand how the task could be released here. I'm curious if just https://github.com/dask/distributed/pull/6123/files#r850533377 would fix things without your change here.

elif ts not in recommendations and ts.state in {"fetch", "flight"}:

This change would cause a few tests to fail, one among them is distributed/tests/test_worker.py::test_task_flight_compute_oserror

I don't understand how the task could be released here. I'm curious if just https://github.com/dask/distributed/pull/6123/files#r850533377 would fix things without your change here.

This doesn't seem to point to anything in particular. Maybe the link was broken by a push?

fjetter

I don't think the solution is pretty. Can you try setting up a more thorough test? This test only triggers the condition a bunch of times for me. Most recently I hit the TaskState in "memory" and not "released" problem.

fjetter · 2022-04-14T15:13:18Z

We should not transition memory to missing in the OSError handler. If we change this and the test are green, I'm OK with merging. This should be already much better than what we have. This may only be a partial fix and we'll take another look at it after the release to have a solid test case.

gjoseph92

@mrocklin it might be worth splitting this into two PRs. We could get the KillWorker plugin merged right now. Then discuss the fix to the deadlock found in test_chaos_rechunk separately.

distributed/worker.py

gjoseph92 · 2022-04-14T15:49:43Z

distributed/worker.py

-                        recommendations[ts] = "fetch" if ts.who_has else "missing"
+                        if ts.state != "released":
+                            recommendations[ts] = "fetch" if ts.who_has else "missing"


ts not in recommendations is already basically handled a few lines above: https://github.com/dask/distributed/pull/6123/files#diff-0c7375c95aaa92986f8ed5460bae4e98720a14fca13556e01fd44ab6b927c24dR3089

Doesn't this clause only come into play if the get-data request succeeds, but the worker just doesn't have the key?

I don't understand how the task could be released here. I'm curious if just https://github.com/dask/distributed/pull/6123/files#r850533377 would fix things without your change here.

mrocklin · 2022-04-14T16:02:02Z

@mrocklin it might be worth splitting this into two PRs. We could get the KillWorker plugin merged right now. Then discuss the fix to the deadlock found in test_chaos_rechunk separately.

Yeah, I was thinking something similar, although for slightly different reasons. I agree with getting KillWorker in soon. I suspect that I'll also want to get the rest of the work (or a cleaned up version of it) in too, but that we'll want to revert it when we figure out a more grounded solution.

This is to be used for more strenuous stress testing

This reverts commit 0e937b3.

Tests like distributed/tests/test_worker.py::test_task_flight_compute_oserror still pass with these but didn't before

mrocklin · 2022-04-15T01:46:24Z

Well here is an interesting failure (rarely occurs, fortunately CI caught it)

  File "D:\a\distributed\distributed\distributed\worker.py", line 2647, in _handle_stimulus_from_task

    self.handle_stimulus(stim)

  File "D:\a\distributed\distributed\distributed\worker.py", line 2633, in handle_stimulus

    self.transitions(recs, stimulus_id=stim.stimulus_id)

  File "D:\a\distributed\distributed\distributed\worker.py", line 2609, in transitions

    a_recs, a_instructions = self._transition(

  File "D:\a\distributed\distributed\distributed\worker.py", line 2521, in _transition

    recs, instructions = func(ts, *args, stimulus_id=stimulus_id, **kwargs)

  File "D:\a\distributed\distributed\distributed\worker.py", line 2381, in transition_ready_executing

    assert all(

I think that this was in distributed/tests/test_client.py::test_threadsafe_get https://github.com/dask/distributed/runs/6032728170?check_suite_focus=true

I'm going to relax the change back to just the released state

mrocklin · 2022-04-15T02:03:48Z

OK, I've got the test running with more sensitivity on my local machine (haven't pushed this up yet). I plan to walk back the changes to just released+memory (clearly things that weren't working before) rather than go ahead with an allow-list (harder to reason about in the short-term). I'm going to try expanding the list outwards and see what I get to. However, that'll be for tomorrow. Calling it a night.

mrocklin · 2022-04-15T10:59:06Z

OK, this feels good to me. @gjoseph92 if you can take a look in the morning that would be welcome.

mrocklin · 2022-04-15T11:55:29Z

Tests are good so far. Intermittent failure handled separately in #6131

gjoseph92 · 2022-04-15T16:08:45Z

distributed/worker.py

+                        if ts.who_has:
+                            recommendations[ts] = "fetch"
+                        elif ts.state not in ("released", "memory"):
+                            recommendations[ts] = "missing"


Could the task be in memory or missing even if who_has is non empty?

Guess I'm wondering why this isn't:

Suggested change

if ts.who_has:

recommendations[ts] = "fetch"

elif ts.state not in ("released", "memory"):

recommendations[ts] = "missing"

if ts.state not in ("released", "memory"):

recommendations[ts] = "fetch" if ts.who_has else "missing"

For that matter, could ts be errored? If ts was stolen to this worker (and executed) in the time since the gather_dep started, and execution failed? What about other terminal states?

I feel like an passlist here (likely fetch/flight) still would make more sense. If some action has occurred in the meantime since gather_dep started which has changed the task's state to something non-dependency-fetch-y, that probably takes precedence over re-fetching it right? I'd be curious if/why an passlist doesn't work.

I'm happy to make a change like what is above because it seems safe relative to what is here already.

When I implemented the allowlist / passlist tests failed, so our reasoning here is not yet correct. I think that we should absolutely pursue this, but that we should handle that separately. I'd like to get this in for a release and I don't want to increase scope much beyond "avoid released/memory -> missing transitions". I totally get that we might want to do more eventually, but I don't think that we need to do more in order to do this.

I'm happy to make a change like what is above because it seems safe relative to what is here already.

Wait, no, it looks like that would exclude more paths. I'm really looking for a minimal set of changes that avoid the deadlock for right now.

gjoseph92 · 2022-04-15T16:14:07Z

distributed/worker.py

@@ -3044,7 +3046,7 @@ async def gather_dep(
                for d in has_what:
                    ts = self.tasks[d]
                    ts.who_has.remove(worker)
-                    if not ts.who_has:
+                    if not ts.who_has and ts.state not in ("released", "memory"):


I see the failure in https://github.com/dask/distributed/runs/6032728170?check_suite_focus=true

assert all( dep.key in self.data or dep.key in self.actors for dep in ts.dependencies )

but I'm not following why relaxing this from ts.state in ("fetch", "flight") caused it. I accept that it seems to work, but I'd like to understand it.

If you'd be willing to add an assert ts.state in ("fetch", "flight"), ts.state within this current conditional, I'd be very curious to see what fails in CI. I can't really make sense of why we'd want to transition tasks in any other state to missing.

I'd like to understand it.

I encourage you to understand it :)

If you'd be willing to add an assert ts.state in ("fetch", "flight"), ts.state within this current conditional, I'd be very curious to see what fails in CI

I think that we should do this in another PR and get this in as it is. I'd like to resolve this deadlock and get out a release.

mrocklin · 2022-04-15T17:48:22Z

OK, I'm going to go ahead and merge this. The feedback here is good, and points us in a direction with two steps:

We should figure out what is going on here
We should try to expand the coverage of which other states we might want to disallow

No one seems to object that we should disallow these transitions though. I'm going to start here and we can expand as proves useful. Merging in 1hr for the release.

gjoseph92 · 2022-04-15T19:08:34Z

I tested on the plane, and adding resumed to the allowlist seems to make it work instead of the blocklist you have here. I'm guessing that was the issue, since resumed is another "fetch-y" state.

mrocklin · 2022-04-15T19:42:22Z

I think that I'm suspicious enough that I'm going to merge this in without adding the extra changes. (they seem to strictly increase the scope risk to me). I'll submit a second PR afterwards directly though.

mrocklin · 2022-04-15T20:04:44Z

Following up in #6140

crusaderky · 2022-04-24T09:33:33Z

distributed/tests/test_stress.py

+            return
+        await asyncio.sleep(0.1)
+
+    await z.cancel()


What does this actually test? Every 4 seconds, all workers will shut down, losing all accrued progress, and will be start being restarted in chain by the nannies. Over time, timings will drift apart and reach a somewhat random distribution of restarts. It's close to impossible to reach z.status=="finished", as it would take way more than 4s to finish (it takes 13s without KillWorker on a speedy laptop). Because of the very high allowed_failures, status should never be failed. If the cluster deadlocks e.g. due to illegal transitions, the test won't notice.

Add KillWorker worker plugin

93fa4dc

This is to be used for more strenuous stress testing

gjoseph92 reviewed Apr 13, 2022

View reviewed changes

distributed/chaos.py Outdated Show resolved Hide resolved

distributed/chaos.py Outdated Show resolved Hide resolved

distributed/chaos.py Outdated Show resolved Hide resolved

distributed/chaos.py Outdated Show resolved Hide resolved

gjoseph92 reviewed Apr 13, 2022

View reviewed changes

distributed/chaos.py Outdated Show resolved Hide resolved

mrocklin and others added 2 commits April 13, 2022 19:37

Apply suggestions from code review

32f8177

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Add nicer segfault option

7dffe7d

mrocklin added 3 commits April 14, 2022 07:25

Set kewords for graceful worker shutdown

30a1194

Add rechunk chaos test

2b2e5d2

Avoid transitions from releaesd to missing

d282a99

Is this right? My guess is that things happened in the meantime that mean that we don't actually want these tasks any more. I'm not certain though.

mrocklin commented Apr 14, 2022

View reviewed changes

fjetter reviewed Apr 14, 2022

View reviewed changes

distributed/tests/test_stress.py Outdated Show resolved Hide resolved

fjetter reviewed Apr 14, 2022

View reviewed changes

distributed/worker.py Outdated Show resolved Hide resolved

fjetter reviewed Apr 14, 2022

View reviewed changes

fjetter requested changes Apr 14, 2022

View reviewed changes

gjoseph92 reviewed Apr 14, 2022

View reviewed changes

mrocklin added 3 commits April 14, 2022 11:07

use random port

6d9011d

Address comments

0e937b3

Add KillWorker Plugin

2b6f4a5

This is to be used for more strenuous stress testing

mrocklin mentioned this pull request Apr 14, 2022

Add KillWorker Plugin #6126

Merged

mrocklin changed the title ~~Add KillWorker worker plugin~~ Add killworker test for array rechunk Apr 14, 2022

mrocklin changed the title ~~Add killworker test for array rechunk~~ Add KillWorker test and handle InvalidTransition issue Apr 14, 2022

mrocklin added 7 commits April 14, 2022 11:54

add docstring

6847bb9

avoid thread leak in test

b153435

Revert "Address comments"

88a27c8

This reverts commit 0e937b3.

Add back half of the changes

3df7749

Tests like distributed/tests/test_worker.py::test_task_flight_compute_oserror still pass with these but didn't before

Skip windows segfault test

7cf7901

Merge branch 'kill-worker' into chaos

caec654

Merge branch 'main' of github.com:dask/distributed into chaos

7c50898

relax back to just avoiding released

15768d4

mrocklin mentioned this pull request Apr 15, 2022

InvalidTransition: Impossible transition from memory to missing #6125

Closed

avoid memory transitions, clean up test

ee8065a

mrocklin changed the title ~~Add KillWorker test and handle InvalidTransition issue~~ Avoid transitioning from memory/released to missing in worker Apr 15, 2022

mrocklin mentioned this pull request Apr 15, 2022

Release tomorrow dask/community#239

Closed

gjoseph92 reviewed Apr 15, 2022

View reviewed changes

mrocklin merged commit 7a5bd44 into dask:main Apr 15, 2022

mrocklin deleted the chaos branch April 15, 2022 19:43

mrocklin mentioned this pull request Apr 15, 2022

Only allow fetch, flight, and resumed to transition to missing #6140

Closed

gjoseph92 mentioned this pull request Apr 19, 2022

Computation deadlocks due to worker rapidly running out of memory instead of spilling #6110

Closed

crusaderky reviewed Apr 24, 2022

View reviewed changes

crusaderky mentioned this pull request Apr 25, 2022

Task state validation failure for fetch with who_has #6147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid transitioning from memory/released to missing in worker #6123

Avoid transitioning from memory/released to missing in worker #6123

mrocklin commented Apr 13, 2022

github-actions bot commented Apr 14, 2022 •

edited

Loading

mrocklin Apr 14, 2022

fjetter Apr 14, 2022

fjetter Apr 14, 2022

mrocklin Apr 14, 2022

mrocklin Apr 14, 2022

fjetter Apr 14, 2022

fjetter Apr 14, 2022

fjetter Apr 14, 2022

gjoseph92 Apr 14, 2022

mrocklin Apr 14, 2022

mrocklin Apr 14, 2022

fjetter left a comment

fjetter commented Apr 14, 2022

gjoseph92 left a comment

gjoseph92 Apr 14, 2022

mrocklin commented Apr 14, 2022

mrocklin commented Apr 15, 2022

mrocklin commented Apr 15, 2022

mrocklin commented Apr 15, 2022

mrocklin commented Apr 15, 2022

gjoseph92 Apr 15, 2022

mrocklin Apr 15, 2022

mrocklin Apr 15, 2022

gjoseph92 Apr 15, 2022

mrocklin Apr 15, 2022

mrocklin commented Apr 15, 2022

gjoseph92 commented Apr 15, 2022

mrocklin commented Apr 15, 2022

mrocklin commented Apr 15, 2022

crusaderky Apr 24, 2022

mrocklin Apr 25, 2022

		if ts.state != "released":
		recommendations[ts] = "fetch" if ts.who_has else "missing"

Avoid transitioning from memory/released to missing in worker #6123

Avoid transitioning from memory/released to missing in worker #6123

Conversation

mrocklin commented Apr 13, 2022

github-actions bot commented Apr 14, 2022 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

fjetter commented Apr 14, 2022

gjoseph92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Apr 14, 2022

mrocklin commented Apr 15, 2022

mrocklin commented Apr 15, 2022

mrocklin commented Apr 15, 2022

mrocklin commented Apr 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Apr 15, 2022

gjoseph92 commented Apr 15, 2022

mrocklin commented Apr 15, 2022

mrocklin commented Apr 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 14, 2022 •

edited

Loading