Implement a cluster status method, to know if workers are really running #11

guillaumeeb · 2018-03-11T23:19:19Z

Currently, we only know that we have submitted some jobs to a cluster scheduler. We don't know if these jobs are running or queued, or in any other state.

What do you think of implementing a kind of status method?

In the PBS case for example, it would issue a qstat call, and get the PBS scheduler status of every jobs handled by the Dask cluster.

Not sure if this is really needed, as we are able with the use of Dask Client to know the real size of the cluster (and maybe by some other means).

Maybe this issue is just about documenting how to retrieve the information about a cluster state, either by job scheduler (eg. PBS) API, either using Dask API.

jhamman · 2018-03-12T02:44:55Z

I think this would be useful when using an autoscaling cluster. Particularly when jobs are waiting in the queue and the adaptive cluster is trying to decide if it should scale up/down.

mrocklin · 2018-03-12T03:13:40Z

I agree that understanding the status would be useful, especially for things like adaptive scheduling.

…

On Sun, Mar 11, 2018 at 10:44 PM, Joe Hamman ***@***.***> wrote: I think this would be useful when using an autoscaling cluster. Particularly when jobs are waiting in the queue and the adaptive cluster is trying to decide if it should scale up/down. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#11 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszB5OLNsqxhKZyGE8nFsfvbUENCRMks5tdeEngaJpZM4Sl6D7> .

jhamman · 2018-05-02T18:52:13Z

Is the following approach a dependable way of querying the cluster for the number of workers?

min_workers = 10  # for example
while len(cluster.workers) < min_workers:
    sleep(1)

The next step would be to associate these workers with the job ids from pbs/slurm/etc.

mrocklin · 2018-05-02T18:59:45Z

I believe that cluster.workers is only a convention at this point. It's managed by the cluster object and so represents workers that have been asked for, but not workers that have connected. For the latter you would want the following:

cluster.scheduler.workers

mrocklin · 2018-05-02T19:03:21Z

Having cluster.workers be a dictionary mapping job id to a status in {'pending', 'running', 'finished'} might be nice. For some operations we'll want a mapping between job-id and address (like tcp://...) so that we know which job id corresponds to which worker. I suspect that the cleanest way to do this is to send the job-id through something like an environment variable or the --name keyword (we may already do this sometimes?)

rabernat · 2018-06-23T21:25:44Z

I am using dask-jobqueue today for the first time. I find myself wishing I could check the status of the worker jobs from the notebook (rather than flipping back to the shell and running qstat). So I agree this would be a great feature.

jhamman · 2018-06-24T05:19:18Z

@rabernat - most of the hard work for this functionality is being done in #63. Stay tuned.

andersy005 · 2020-07-08T17:05:26Z

This appears to have been addressed:

In [15]: import dask_jobqueue

In [16]: cluster = dask_jobqueue.SLURMCluster(cores=1, processes=1)

In [17]: cluster.worker_spec
Out[17]: {}

In [18]: cluster.workers
Out[18]: {}

In [19]: cluster.scale(2)

In [20]: cluster.workers
Out[20]: {0: <SLURMJob: status=running>, 1: <SLURMJob: status=running>}

However, it appears that the reported information can be misleading sometimes. For instance, I am using a system which restricts the number of concurrent jobs to 35. When I submit 40 jobs, dask_jobqueue reports that all 40 workers are in running state even when some of the jobs are in pending state according to squeue:

dask-jobqueue reports all 40 workers to be in running state:

In [22]: cluster.scale(40)

In [23]: cluster.workers
Out[23]:
{0: <SLURMJob: status=running>,
 1: <SLURMJob: status=running>,
 2: <SLURMJob: status=running>,
 3: <SLURMJob: status=running>,
 4: <SLURMJob: status=running>,
 5: <SLURMJob: status=running>,
 6: <SLURMJob: status=running>,
 7: <SLURMJob: status=running>,
 8: <SLURMJob: status=running>,
 9: <SLURMJob: status=running>,
 10: <SLURMJob: status=running>,
 11: <SLURMJob: status=running>,
 12: <SLURMJob: status=running>,
 13: <SLURMJob: status=running>,
 14: <SLURMJob: status=running>,
 15: <SLURMJob: status=running>,
 16: <SLURMJob: status=running>,
 17: <SLURMJob: status=running>,
 18: <SLURMJob: status=running>,
 19: <SLURMJob: status=running>,
 20: <SLURMJob: status=running>,
 21: <SLURMJob: status=running>,
 22: <SLURMJob: status=running>,
 23: <SLURMJob: status=running>,
 24: <SLURMJob: status=running>,
 25: <SLURMJob: status=running>,
 26: <SLURMJob: status=running>,
 27: <SLURMJob: status=running>,
 28: <SLURMJob: status=running>,
 29: <SLURMJob: status=running>,
 30: <SLURMJob: status=running>,
 31: <SLURMJob: status=running>,
 32: <SLURMJob: status=running>,
 33: <SLURMJob: status=running>,
 34: <SLURMJob: status=running>,
 35: <SLURMJob: status=running>,
 36: <SLURMJob: status=running>,
 37: <SLURMJob: status=running>,
 38: <SLURMJob: status=running>,
 39: <SLURMJob: status=running>}

squeue shows some of the jobs to be in pending state:

In [24]: !squeue -u abanihi
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5543014       dav dask-wor  abanihi PD       0:00      1 (QOSMaxJobsPerUserLimit)
           5543015       dav dask-wor  abanihi PD       0:00      1 (QOSMaxJobsPerUserLimit)
           5543016       dav dask-wor  abanihi PD       0:00      1 (QOSMaxJobsPerUserLimit)
           5543017       dav dask-wor  abanihi PD       0:00      1 (QOSMaxJobsPerUserLimit)
           5543002       dav dask-wor  abanihi  R       0:31      1 casper17
           5543003       dav dask-wor  abanihi  R       0:31      1 casper17
           5543004       dav dask-wor  abanihi  R       0:31      1 casper17
           5543005       dav dask-wor  abanihi  R       0:31      1 casper17
           5543006       dav dask-wor  abanihi  R       0:31      1 casper17
           5543007       dav dask-wor  abanihi  R       0:31      1 casper17
           5543008       dav dask-wor  abanihi  R       0:31      1 casper17
           5543009       dav dask-wor  abanihi  R       0:31      1 casper17
           5543010       dav dask-wor  abanihi  R       0:31      1 casper17
           5543011       dav dask-wor  abanihi  R       0:31      1 casper17
           5543012       dav dask-wor  abanihi  R       0:31      1 casper17
           5543013       dav dask-wor  abanihi  R       0:31      1 casper17
           5542985       dav dask-wor  abanihi  R       0:34      1 casper12
           5542986       dav dask-wor  abanihi  R       0:34      1 casper15
           5542987       dav dask-wor  abanihi  R       0:34      1 casper15
           5542988       dav dask-wor  abanihi  R       0:34      1 casper15
           5542989       dav dask-wor  abanihi  R       0:34      1 casper22
           5542990       dav dask-wor  abanihi  R       0:34      1 casper22
           5542991       dav dask-wor  abanihi  R       0:34      1 casper22
           5542992       dav dask-wor  abanihi  R       0:34      1 casper22
           5542993       dav dask-wor  abanihi  R       0:34      1 casper22
           5542994       dav dask-wor  abanihi  R       0:34      1 casper22
           5542995       dav dask-wor  abanihi  R       0:34      1 casper22
           5542996       dav dask-wor  abanihi  R       0:34      1 casper22
           5542997       dav dask-wor  abanihi  R       0:34      1 casper22
           5542998       dav dask-wor  abanihi  R       0:34      1 casper22
           5542999       dav dask-wor  abanihi  R       0:34      1 casper22
           5543000       dav dask-wor  abanihi  R       0:34      1 casper17
           5543001       dav dask-wor  abanihi  R       0:34      1 casper17
           5542980       dav dask-wor  abanihi  R       0:35      1 casper10
           5542981       dav dask-wor  abanihi  R       0:35      1 casper19
           5542982       dav dask-wor  abanihi  R       0:35      1 casper19
           5542983       dav dask-wor  abanihi  R       0:35      1 casper12
           5542984       dav dask-wor  abanihi  R       0:35      1 casper12
           5542978       dav dask-wor  abanihi  R       1:08      1 casper10
           5542979       dav dask-wor  abanihi  R       1:08      1 casper10

andersy005 · 2020-07-08T17:08:43Z

However, it appears that the reported information can be misleading sometimes. For instance, I am using a system which restricts the number of concurrent jobs to 35. When I submit 40 jobs, dask_jobqueue reports that all 40 workers are in running state even when some of the jobs are in pending state according to squeue:

Never mind ;)

I hadn't seen @mrocklin's comment

I believe that cluster.workers is only a convention at this point. It's managed by the cluster object and so represents workers that have been asked for, but not workers that have connected. For the latter you would want the following:
cluster.scheduler.workers

In [26]: len(cluster.scheduler.workers)
Out[26]: 36

andersy005 · 2020-07-08T17:12:25Z

I am curious... What needs to be done for this issue to be considered "fixed"? I'd happy to work on missing functionality.

mrocklin · 2020-07-08T17:16:17Z

There is now a convention around plan/requested/observed in the Cluster class that should be what you are looking for I think.

…

On Wed, Jul 8, 2020 at 10:12 AM Anderson Banihirwe ***@***.***> wrote: I am curious... What needs to be done for this issue to be considered "fixed"? I'd happy to work on missing functionality. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTHGUTYRLYI7JBOP7BTR2SSIPANCNFSM4EUXUD5Q> .

jhamman changed the title ~~Implement a cluster status method, to now if worker are really running~~ Implement a cluster status method, to know if workers are really running Mar 27, 2018

jhamman mentioned this issue May 18, 2018

Using grouped workers and Adaptive dask/distributed#1987

Closed

guillaumeeb mentioned this issue May 22, 2018

Fixes for Adaptive #63

Merged

raybellwaves mentioned this issue Jul 13, 2018

Add LSF #4

Closed

guillaumeeb mentioned this issue Aug 28, 2018

Add a cluster detailed status method #140

Closed

andersy005 added the all job schedulers label Jan 24, 2021

guillaumeeb added this to the 0.8.1 milestone Aug 30, 2022

guillaumeeb mentioned this issue Jan 24, 2024

Worker startup timeout leads to inconsistent cluster state #620

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a cluster status method, to know if workers are really running #11

Implement a cluster status method, to know if workers are really running #11

guillaumeeb commented Mar 11, 2018

jhamman commented Mar 12, 2018

mrocklin commented Mar 12, 2018 via email

jhamman commented May 2, 2018 •

edited

Loading

mrocklin commented May 2, 2018

mrocklin commented May 2, 2018

rabernat commented Jun 23, 2018

jhamman commented Jun 24, 2018

andersy005 commented Jul 8, 2020

andersy005 commented Jul 8, 2020

andersy005 commented Jul 8, 2020

mrocklin commented Jul 8, 2020 via email

Implement a cluster status method, to know if workers are really running #11

Implement a cluster status method, to know if workers are really running #11

Comments

guillaumeeb commented Mar 11, 2018

jhamman commented Mar 12, 2018

mrocklin commented Mar 12, 2018 via email

jhamman commented May 2, 2018 • edited Loading

mrocklin commented May 2, 2018

mrocklin commented May 2, 2018

rabernat commented Jun 23, 2018

jhamman commented Jun 24, 2018

andersy005 commented Jul 8, 2020

andersy005 commented Jul 8, 2020

andersy005 commented Jul 8, 2020

mrocklin commented Jul 8, 2020 via email

jhamman commented May 2, 2018 •

edited

Loading