Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeatedly use the same worker on first task #4637

Closed
mrocklin opened this issue Mar 26, 2021 · 3 comments · Fixed by #4638
Closed

Repeatedly use the same worker on first task #4637

mrocklin opened this issue Mar 26, 2021 · 3 comments · Fixed by #4638

Comments

@mrocklin
Copy link
Member

Apparently in a quiet cluster we end up sending new tasks to the same worker repeatedly. So probably the following test would fail:

@gen_cluster(client=True)
def test_round_robin(c, s, a, b):
    await c.submit(inc, 1)
    await c.submit(inc, 2)
    await c.submit(inc, 3)
    assert a.transition_log and b.transition_log

This outcome is determined here:

if ts._dependencies or valid_workers is not None:
ws = decide_worker(
ts,
self._workers_dv.values(),
valid_workers,
partial(self.worker_objective, ts),
)
else:
worker_pool = self._idle or self._workers
worker_pool_dv = cast(dict, worker_pool)
n_workers: Py_ssize_t = len(worker_pool_dv)
if n_workers < 20: # smart but linear in small case
ws = min(worker_pool.values(), key=operator.attrgetter("occupancy"))
else: # dumb but fast in large case
ws = worker_pool.values()[self._n_tasks % n_workers]

It looks like when there are no dependencies and we have only a few workers we currently choose the worker with minimum occupancy. In a quiet cluster all workers have zero occupancy, so probably we're getting whatever Python uses to break a tie in this setting.

In the case where the occupancy is zero we might do something like a round-robin (this is done just below in the case where we have greater than 20 workers) among the worker pool for as long as that worker has zero occupancy.

Reported in conversation by @crusaderky

@douglasdavis
Copy link
Member

Tinkering with this. The test indeed fails ;) but perhaps the reason is not what you were trying to trigger? I'm seeing transition_log is not an attribute of distributed.worker.Worker. Searching through the repo I only see instances of s.transition_log scattered in a bunch of tests.

Is your assert trying to test that both workers a and b have some evidence of usage? I might be barking up the wrong tree.

@mrocklin
Copy link
Member Author

Is your assert trying to test that both workers a and b have some evidence of usage?

Yes, this was exactly my intent. The issue was authored in haste, my apologies for the messiness here. Looking again at worker.py it looks like we might look at Worker.log instead.

@douglasdavis
Copy link
Member

Thanks for the clarification!

douglasdavis added a commit to douglasdavis/distributed that referenced this issue Mar 26, 2021
douglasdavis added a commit to douglasdavis/distributed that referenced this issue Mar 29, 2021
douglasdavis added a commit to douglasdavis/distributed that referenced this issue Mar 29, 2021
douglasdavis added a commit to douglasdavis/distributed that referenced this issue Mar 30, 2021
douglasdavis added a commit to douglasdavis/distributed that referenced this issue Mar 31, 2021
douglasdavis added a commit to douglasdavis/distributed that referenced this issue Apr 1, 2021
douglasdavis added a commit to douglasdavis/distributed that referenced this issue Apr 2, 2021
douglasdavis added a commit to douglasdavis/distributed that referenced this issue Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants