-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster.scale is not robust to multiple calls #2257
Comments
FWIW I suspect that we'll want to just fully rewrite the
distributed/deploy/cluster.py system. Some of my thoughts are here:
#2235
…On Tue, Sep 18, 2018 at 10:43 AM, Guillaume Eynard-Bontemps < ***@***.***> wrote:
As experienced in dask/dask-jobqueue#112
<dask/dask-jobqueue#112> and a related PR
dask/dask-jobqueue#97 <dask/dask-jobqueue#97>,
Cluster.scale behavior is unstable if called multiple times in a row.
I suspect part of this problem is due to how asynchronism is used here:
- We retrieve the cluster number of workers in a synchronous way here
https://github.com/dask/distributed/blob/master/
distributed/deploy/cluster.py#L100
<https://github.com/dask/distributed/blob/master/distributed/deploy/cluster.py#L100>,
but we launch scale_up asynchronously, so something could happen
(here: another call to scale) between state retrieval and effective
scale_up.
- Similarly, we get the worker to close synchronously, but stop them
asynchronously.
If we want scale to run asynchronously, I propose to just add a _scale()
method here (a corountine?) to be called in an async manner from scale().
In this scale, we would get the state and perform the modifications at
the same time:
def _scale(self, n):
with log_errors():
if n >= len(self.scheduler.workers):
self.scale_up(n)
else:
to_close = self.scheduler.workers_to_close(
n=len(self.scheduler.workers) - n)
logger.debug("Closing workers: %s", to_close)
self.scheduler.retire_workers(workers=to_close)
self.scale_down(to_close)
@jhamman <https://github.com/jhamman> @mrocklin
<https://github.com/mrocklin> any opinion, advice?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2257>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszFvV8iTf83MsHEqSmbwcUkXAm1reks5ucQaggaJpZM4WuFZM>
.
|
This was referenced Oct 6, 2018
Closing this issue in favour of #2235 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
As experienced in dask/dask-jobqueue#112 and a related PR dask/dask-jobqueue#97,
Cluster.scale
behavior is unstable if called multiple times in a row.I suspect part of this problem is due to how asynchronism is used here:
scale_up
asynchronously, so something could happen (here: another call toscale
) between state retrieval and effective scale_up.If we want
scale
to run asynchronously, I propose to just add a_scale()
method here (a corountine?) to be called in an async manner fromscale()
. In thisscale
, we would get the state and perform the modifications at the same time:@jhamman @mrocklin any opinion, advice?
The text was updated successfully, but these errors were encountered: