Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError in gather_dep #6194

Closed
mrocklin opened this issue Apr 25, 2022 · 0 comments · Fixed by #6217
Closed

KeyError in gather_dep #6194

mrocklin opened this issue Apr 25, 2022 · 0 comments · Fixed by #6217
Assignees
Labels
deadlock The cluster appears to not make any progress

Comments

@mrocklin
Copy link
Member

This was found when running distributed/tests/test_stress.py::test_chaos_rechunk

Traceback (most recent call last):
  File "/home/mrocklin/workspace/distributed/distributed/utils.py", line 759, in wrapper
    return await func(*args, **kwargs)
  File "/home/mrocklin/workspace/distributed/distributed/worker.py", line 3090, in gather_dep
    ts = self.tasks[d]
KeyError: "('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)"

Story

("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'ensure-task-exists', 'released', 'compute-task-1650891186.0628088', 1650891186.080167)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'released', 'fetch', 'fetch', {}, 'compute-task-1650891186.0628088', 1650891186.0802405)
('gather-dependencies', 'tcp://127.0.0.1:44059', {"('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", "('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1018)", "('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1014)"}, 'ensure-communicating-1650891186.0806584', 1650891186.0826457)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'fetch', 'flight', 'flight', {}, 'ensure-communicating-1650891186.0806584', 1650891186.0826836)
('request-dep', 'tcp://127.0.0.1:44059', {"('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", "('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1018)", "('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1014)"}, 'ensure-communicating-1650891186.0806584', 1650891186.0840328)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'flight', 'released', 'cancelled', {}, 'processing-released-1650891186.0754924', 1650891186.16295)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'compute-task', 'compute-task-1650891186.392905', 1650891186.4060445)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'cancelled', 'waiting', 'cancelled', {"('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)": ('resumed', 'waiting')}, 'compute-task-1650891186.392905', 1650891186.40608)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'cancelled', 'resumed', 'resumed', {}, 'compute-task-1650891186.392905', 1650891186.406102)
('free-keys', ("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)",), 'processing-released-1650891188.465502', 1650891188.5917683)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'release-key', 'processing-released-1650891188.465502', 1650891188.5917764)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'resumed', 'released', 'released', {"('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)": 'forgotten'}, 'processing-released-1650891188.465502', 1650891188.5917976)
("('rechunk-split-82117560f6f829a7fa07bfef62cff7d5', 1006)", 'released', 'forgotten', 'forgotten', {}, 'processing-released-1650891188.465502', 1650891188.591806)

I'm not surprised that there would be a missing key, however I am surprised that we're not catching this in a more graceful way.

cc @fjetter

@mrocklin mrocklin added the deadlock The cluster appears to not make any progress label Apr 25, 2022
mrocklin added a commit to mrocklin/distributed that referenced this issue Apr 25, 2022
A task that is released while still in-flight will stick around in the
in_flight_workers state.  This can cause a KeyError when gather_dep goes
to handle this now-missing-task.

The correct thing to do in this case is to just ignore the key,
which is what this commit does.

We could also keep in_flight_workers up-to-date on a release transition.
I'm not sure how valuable this would be apart from passing.
I'm open to either approach.  This was the easiest.

Fixes dask#6194
@fjetter fjetter self-assigned this Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deadlock The cluster appears to not make any progress
Projects
None yet
2 participants