Restart given worker(s) using client [help wanted] #1823

ameetshah1983 · 2018-03-08T01:35:20Z

Current client.restart() restarts all workers and entire cluster. Is there a way to restart or perform stop, start, restart operation on single / list of workers using the client apis.

This will allow cleanup of specific workers only without affecting others where additional tasks may be running.

mrocklin · 2018-03-09T16:33:19Z

No, there is not currently an easy way to do this.

sheridp · 2018-03-15T18:18:36Z

Is it possible to restart the scheduler? One thing I noticed is that if you have published datasets and then call client.restart, the datasets are still published in the scheduler. If another client then asks for the result from one of those datasets, the scheduler does not reissue the work to a worker. It seems like client.restart should also restart the scheduler or at least unpublish all datasets.

martindurant · 2018-03-15T18:26:52Z

Unpublishing all datasets when performing a restart would be easy to implement, e.g., by adding a restart() method to PublishExtension.

mrocklin · 2018-03-15T19:31:33Z

I recommend raising the published datasets topic as a separate issue.

ameetshah1983 · 2018-04-04T20:20:26Z

Add retire_workers looks like helps shutting down workers without removing them completely. But is there a way to start these workers again using an api.

mrocklin · 2018-04-04T20:22:39Z

You're right. That PR doesn't solve your issue.

VMois · 2018-11-30T22:29:42Z

Any ideas or suggestions about this issue. I want to use worker restart to refresh local packages cache. Is it makes sense to make PR for this.

wkerzendorf · 2019-05-18T13:43:44Z

I would like to restart the worker after every task is run as there seems to be a memory leak with dask and my tasks (which does not occur when the task is run locally). any ideas - or should I make a new PR.

nmatare · 2019-07-07T01:14:19Z

I came across this requirement as well:

I solved it with the below. This assumes your dask-worker(s) are overseen by nanny managers. This will throw a CommClosedError as the scheduler can no longer communicate with the worker but otherwise it should reboot, however hacky.

import os
my_worker = 'name-or-ip-address-to-worker'
client.run(lamda: os._exit(0), workers=[my_worker])

asford · 2021-01-31T10:21:15Z

I'm investigating this as a potential solution to #391 (comment), in which we occasionally see unresponsive workers during network-heavy aggregation operations.

Looking through the existing restart code, it appears that a something akin to "retire workers" could be implemented that first
makes an attempt to replicate data from the worker that's about to be restarted, via:

distributed/distributed/scheduler.py

Lines 5446 to 5462 in 1297b18

    
           # Keys orphaned by retiring those workers 
        
           keys = set.union(*[w.has_what for w in workers]) 
        
           keys = {ts._key for ts in keys if ts._who_has.issubset(workers)} 
        
           other_workers = set(parent._workers_dv.values()) - workers 
        
           if keys: 
        
               if other_workers: 
        
                   logger.info("Moving %d keys to other workers", len(keys)) 
        
                   await self.replicate( 
        
                       keys=keys, 
        
                       workers=[ws._address for ws in other_workers], 
        
                       n=1, 
        
                       delete=False, 
        
                       lock=False, 
        
                   ) 
        
               else: 
        
                   return {}

Then issues a command to the nanny (if present) to restart the worker process:

https://github.com/dask/distributed/blob/1297b18ff09276f6ad1553d40ab3ce77acf0fc0e/distributed/scheduler.py#L4874-L4881#4345

@mrocklin, Would this make sense as an additional scheduler api?

martindurant · 2021-02-01T15:10:10Z

I have looked at retire_workers in the past and I would agree that it makes sense in the restart-particular-worker scenario. However, if the worker is unresponsive, what is the hope to be able to copy its data?

ameetshah1983 · 2021-10-13T20:03:26Z

Tried the retire_workers api using cl.retire_workers and with close_workers=False nothing happens and if default setting the worker completely shutsdown but doesnt come up again.

Any other way to restart a particular worker, may be via nanny or some other method.

ameetshah1983 changed the title ~~Shutdown given worker(s) using client~~ Shutdown given worker(s) using client [help wanted] Mar 8, 2018

ameetshah1983 changed the title ~~Shutdown given worker(s) using client [help wanted]~~ Restart given worker(s) using client [help wanted] Mar 9, 2018

mrocklin mentioned this issue Mar 30, 2018

Add retire_workers API to Client #1876

Merged

mrocklin closed this as completed in #1876 Mar 30, 2018

mrocklin reopened this Apr 4, 2018

filmor mentioned this issue Jul 23, 2019

Gracefully restart a worker from a nanny #2861

Closed

jrbourbeau mentioned this issue Oct 18, 2022

Add Client.restart_workers method #7154

Merged

jrbourbeau closed this as completed in #7154 Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart given worker(s) using client [help wanted] #1823

Restart given worker(s) using client [help wanted] #1823

ameetshah1983 commented Mar 8, 2018

mrocklin commented Mar 9, 2018

sheridp commented Mar 15, 2018

martindurant commented Mar 15, 2018

mrocklin commented Mar 15, 2018

ameetshah1983 commented Apr 4, 2018 •

edited

Loading

mrocklin commented Apr 4, 2018

VMois commented Nov 30, 2018

wkerzendorf commented May 18, 2019

nmatare commented Jul 7, 2019 •

edited

Loading

asford commented Jan 31, 2021

martindurant commented Feb 1, 2021

ameetshah1983 commented Oct 13, 2021

Restart given worker(s) using client [help wanted] #1823

Restart given worker(s) using client [help wanted] #1823

Comments

ameetshah1983 commented Mar 8, 2018

mrocklin commented Mar 9, 2018

sheridp commented Mar 15, 2018

martindurant commented Mar 15, 2018

mrocklin commented Mar 15, 2018

ameetshah1983 commented Apr 4, 2018 • edited Loading

mrocklin commented Apr 4, 2018

VMois commented Nov 30, 2018

wkerzendorf commented May 18, 2019

nmatare commented Jul 7, 2019 • edited Loading

asford commented Jan 31, 2021

martindurant commented Feb 1, 2021

ameetshah1983 commented Oct 13, 2021

ameetshah1983 commented Apr 4, 2018 •

edited

Loading

nmatare commented Jul 7, 2019 •

edited

Loading