-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a cluster status method, to know if workers are really running #11
Comments
I think this would be useful when using an autoscaling cluster. Particularly when jobs are waiting in the queue and the adaptive cluster is trying to decide if it should scale up/down. |
I agree that understanding the status would be useful, especially for
things like adaptive scheduling.
…On Sun, Mar 11, 2018 at 10:44 PM, Joe Hamman ***@***.***> wrote:
I think this would be useful when using an autoscaling cluster.
Particularly when jobs are waiting in the queue and the adaptive cluster is
trying to decide if it should scale up/down.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszB5OLNsqxhKZyGE8nFsfvbUENCRMks5tdeEngaJpZM4Sl6D7>
.
|
Is the following approach a dependable way of querying the cluster for the number of workers? min_workers = 10 # for example
while len(cluster.workers) < min_workers:
sleep(1) The next step would be to associate these workers with the job ids from pbs/slurm/etc. |
I believe that cluster.workers is only a convention at this point. It's managed by the cluster object and so represents workers that have been asked for, but not workers that have connected. For the latter you would want the following: cluster.scheduler.workers |
Having cluster.workers be a dictionary mapping job id to a status in |
I am using dask-jobqueue today for the first time. I find myself wishing I could check the status of the worker jobs from the notebook (rather than flipping back to the shell and running |
This appears to have been addressed: In [15]: import dask_jobqueue
In [16]: cluster = dask_jobqueue.SLURMCluster(cores=1, processes=1)
In [17]: cluster.worker_spec
Out[17]: {}
In [18]: cluster.workers
Out[18]: {}
In [19]: cluster.scale(2)
In [20]: cluster.workers
Out[20]: {0: <SLURMJob: status=running>, 1: <SLURMJob: status=running>} However, it appears that the reported information can be misleading sometimes. For instance, I am using a system which restricts the number of concurrent jobs to 35. When I submit 40 jobs,
In [22]: cluster.scale(40)
In [23]: cluster.workers
Out[23]:
{0: <SLURMJob: status=running>,
1: <SLURMJob: status=running>,
2: <SLURMJob: status=running>,
3: <SLURMJob: status=running>,
4: <SLURMJob: status=running>,
5: <SLURMJob: status=running>,
6: <SLURMJob: status=running>,
7: <SLURMJob: status=running>,
8: <SLURMJob: status=running>,
9: <SLURMJob: status=running>,
10: <SLURMJob: status=running>,
11: <SLURMJob: status=running>,
12: <SLURMJob: status=running>,
13: <SLURMJob: status=running>,
14: <SLURMJob: status=running>,
15: <SLURMJob: status=running>,
16: <SLURMJob: status=running>,
17: <SLURMJob: status=running>,
18: <SLURMJob: status=running>,
19: <SLURMJob: status=running>,
20: <SLURMJob: status=running>,
21: <SLURMJob: status=running>,
22: <SLURMJob: status=running>,
23: <SLURMJob: status=running>,
24: <SLURMJob: status=running>,
25: <SLURMJob: status=running>,
26: <SLURMJob: status=running>,
27: <SLURMJob: status=running>,
28: <SLURMJob: status=running>,
29: <SLURMJob: status=running>,
30: <SLURMJob: status=running>,
31: <SLURMJob: status=running>,
32: <SLURMJob: status=running>,
33: <SLURMJob: status=running>,
34: <SLURMJob: status=running>,
35: <SLURMJob: status=running>,
36: <SLURMJob: status=running>,
37: <SLURMJob: status=running>,
38: <SLURMJob: status=running>,
39: <SLURMJob: status=running>}
In [24]: !squeue -u abanihi
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5543014 dav dask-wor abanihi PD 0:00 1 (QOSMaxJobsPerUserLimit)
5543015 dav dask-wor abanihi PD 0:00 1 (QOSMaxJobsPerUserLimit)
5543016 dav dask-wor abanihi PD 0:00 1 (QOSMaxJobsPerUserLimit)
5543017 dav dask-wor abanihi PD 0:00 1 (QOSMaxJobsPerUserLimit)
5543002 dav dask-wor abanihi R 0:31 1 casper17
5543003 dav dask-wor abanihi R 0:31 1 casper17
5543004 dav dask-wor abanihi R 0:31 1 casper17
5543005 dav dask-wor abanihi R 0:31 1 casper17
5543006 dav dask-wor abanihi R 0:31 1 casper17
5543007 dav dask-wor abanihi R 0:31 1 casper17
5543008 dav dask-wor abanihi R 0:31 1 casper17
5543009 dav dask-wor abanihi R 0:31 1 casper17
5543010 dav dask-wor abanihi R 0:31 1 casper17
5543011 dav dask-wor abanihi R 0:31 1 casper17
5543012 dav dask-wor abanihi R 0:31 1 casper17
5543013 dav dask-wor abanihi R 0:31 1 casper17
5542985 dav dask-wor abanihi R 0:34 1 casper12
5542986 dav dask-wor abanihi R 0:34 1 casper15
5542987 dav dask-wor abanihi R 0:34 1 casper15
5542988 dav dask-wor abanihi R 0:34 1 casper15
5542989 dav dask-wor abanihi R 0:34 1 casper22
5542990 dav dask-wor abanihi R 0:34 1 casper22
5542991 dav dask-wor abanihi R 0:34 1 casper22
5542992 dav dask-wor abanihi R 0:34 1 casper22
5542993 dav dask-wor abanihi R 0:34 1 casper22
5542994 dav dask-wor abanihi R 0:34 1 casper22
5542995 dav dask-wor abanihi R 0:34 1 casper22
5542996 dav dask-wor abanihi R 0:34 1 casper22
5542997 dav dask-wor abanihi R 0:34 1 casper22
5542998 dav dask-wor abanihi R 0:34 1 casper22
5542999 dav dask-wor abanihi R 0:34 1 casper22
5543000 dav dask-wor abanihi R 0:34 1 casper17
5543001 dav dask-wor abanihi R 0:34 1 casper17
5542980 dav dask-wor abanihi R 0:35 1 casper10
5542981 dav dask-wor abanihi R 0:35 1 casper19
5542982 dav dask-wor abanihi R 0:35 1 casper19
5542983 dav dask-wor abanihi R 0:35 1 casper12
5542984 dav dask-wor abanihi R 0:35 1 casper12
5542978 dav dask-wor abanihi R 1:08 1 casper10
5542979 dav dask-wor abanihi R 1:08 1 casper10 |
Never mind ;) I hadn't seen @mrocklin's comment
In [26]: len(cluster.scheduler.workers)
Out[26]: 36 |
I am curious... What needs to be done for this issue to be considered "fixed"? I'd happy to work on missing functionality. |
There is now a convention around plan/requested/observed in the Cluster
class that should be what you are looking for I think.
…On Wed, Jul 8, 2020 at 10:12 AM Anderson Banihirwe ***@***.***> wrote:
I am curious... What needs to be done for this issue to be considered
"fixed"? I'd happy to work on missing functionality.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTHGUTYRLYI7JBOP7BTR2SSIPANCNFSM4EUXUD5Q>
.
|
Currently, we only know that we have submitted some jobs to a cluster scheduler. We don't know if these jobs are running or queued, or in any other state.
What do you think of implementing a kind of status method?
In the PBS case for example, it would issue a qstat call, and get the PBS scheduler status of every jobs handled by the Dask cluster.
Not sure if this is really needed, as we are able with the use of Dask Client to know the real size of the cluster (and maybe by some other means).
Maybe this issue is just about documenting how to retrieve the information about a cluster state, either by job scheduler (eg. PBS) API, either using Dask API.
The text was updated successfully, but these errors were encountered: