Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: waiting on released dep when worker-saturation < .inf #7204

Closed
Tracked by #7213
crusaderky opened this issue Oct 27, 2022 · 3 comments
Closed
Tracked by #7213
Assignees
Labels
bug Something is broken scheduler

Comments

@crusaderky
Copy link
Collaborator

test_decide_worker_rootish_while_last_worker_is_retiring (#7065) fails with worker-saturation: 1.5:

Traceback (most recent call last):
  File "/home/crusaderky/github/distributed/distributed/scheduler.py", line 1418, in validate
    validate_task_state(self)
  File "/home/crusaderky/github/distributed/distributed/scheduler.py", line 8173, in validate_task_state
    assert dts.state != "released", ("waiting on released dep", str(ts), str(dts))
AssertionError: ('waiting on released dep', "<TaskState 'y-1' waiting>", "<TaskState 'x-0' released>")
2022-10-27 11:21:06,662 - distributed.scheduler - ERROR - 
Traceback (most recent call last):
  File "/home/crusaderky/github/distributed/distributed/scheduler.py", line 5053, in validate_key
    func(key)
  File "/home/crusaderky/github/distributed/distributed/scheduler.py", line 4977, in validate_waiting
    assert ts in dts.waiters  # XXX even if dts._who_has?
AssertionError
@gjoseph92
Copy link
Collaborator

I think this failure is a red herring. I don't think there's actually any issue here.

The test in #7065 is not appropriate to run with queuing on. The test asserts many things that are specific to no-queuing co-assignment.

What happens is simply that this assertion fails with queuing on (since there's obviously no last worker):

assert s.tasks["y-2"].group.last_worker == s.workers[a.address]

Then the test hangs for a while because it's trying to clean up the cluster, but the workers' threadpools are all busy, so eventually it just times out. I don't think the validation errors you're seeing here are real problems, I think they're artifacts of an ugly cluster shutdown.

I haven't been able to come up with a way to formulate this test into something that would be meaningful with queuing. This test is relying on a couple of bad behaviors:

  1. The three x tasks aren't considered root-ish, even though they don't have dependencies, because their task group is too small. So they won't be queued, even though it would be appropriate to queue them.

  2. The y tasks are considered root-ish because of their naming, but structurally, they're really not root-ish tasks Fix decide_worker_rootish_queuing_disabled assert #7065 (comment). They should be scheduled with regard to their dependencies and not go through the root task logic at all. Root-ish tasks should, logically, ignore the location of their dependencies when scheduling (if their dependencies mattered, they wouldn't be root-ish). So when queuing is on, they'll be scheduled on any worker.

    That nullifies the core thing this test is trying to test: what if a task should be scheduled onto a particular worker because its dependencies are there, but that worker is closing_gracefully? Well, with queuing on, there's no concept of what worker it "should" be scheduled on, because dependencies are ignored. Additionally, the main thing this test was trying to capture is when last_used_worker is now closing_gracefully. That's irrelevant with queuing, since there's no statefulness and no last_used_worker.

We already have some queueing tests around paused workers. Maybe we should add one for retiring workers too, just to be safe, though it should be the same code path. But I don't think there's a bug here; instead, we should just make the test in #7065 only run when queuing is turned off.

@gjoseph92 gjoseph92 closed this as not planned Won't fix, can't repro, duplicate, stale Oct 27, 2022
@crusaderky
Copy link
Collaborator Author

The three x tasks aren't considered root-ish, even though they don't have dependencies, because their task group is too small. So they won't be queued, even though it would be appropriate to queue them.

By the current definition of rootish, the only way a rootish group could depend on another rootish group would be to have exactly 1 worker with 1 thread:

  • must have at least total_nthreads * 2 + 1 tasks
  • must have at most 4 dependencies

the above cannot be satisfied for both dependent and dependency already with total_nthreads=2.

@gjoseph92
Copy link
Collaborator

Correct. I'm saying the current definition of rootish is a bit silly because something that is definitively a root task (has no dependencies) is not considered rootish. And something that shouldn't be considered rootish (the y tasks in your graph) is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken scheduler
Projects
None yet
Development

No branches or pull requests

2 participants