Resource specification on GridEngine Clusters #195

ericmjl · 2018-11-14T18:12:21Z

Hi Dask team!

My colleague @sntgluca and I have been very enthusiastic about the possibilities enabled by dask-jobqueue at NIBR! It's been a very productivity-enhancing tool for me. At the same time, we found something that we think might be a be a bug, but would like to disprove/confirm this before potentially working on a PR to fix it.

Firstly, we found that when using the memory keyword argument, the SGECluster will show that XGB per worker node is allocated. However, the amount of RAM that is allocated, according to the queueing status screen, is only the default amount specified by the sysadmins.

Here is some evidence that I collected from the logs on our machines.

Firstly, Dask's worker logs show that 8GB is allocated to them:

# Ignore this line, it is here to show the job ID.
/path/to/job_scripts/16524465: line 10: /path/to/activate: Permission denied
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.00 GB

However, for the same job ID, when I used qstat -j JOBID:

==============================================================
job_number:                 16524465
...
sge_o_path:                 hard resource_list:         m_mem_free=4G,h_rt=259200,slot_limitA=1
...
granted_req.          1:    m_mem_free=4.000G, slot_limitA=1

As you can see, the resources granted were only 4GB of memory, not the 8GB requested, but the Dask worker logs show 8GB being allocated.

I have a hunch that this is a bug, but both @sntgluca have this idea that our end users shouldn't have to worry about GridEngine resource spec strings, and should be able to use the very nice SGECluster API to set these parameters correctly. Looking at the SGECluster source code, it looks doable with a small-ish PR to parse the memory (and other kwargs) into the correct resource specification string, if this is something that you would be open to.

Please let us know!

The text was updated successfully, but these errors were encountered:

jhamman · 2018-11-14T23:26:59Z

Thanks for the report @ericmjl. Can you show us the job script that jobqueue is submitting? (http://jobqueue.dask.org/en/latest/debug.html#checking-job-script)

Also, showing us how your calling SGECluster and your dask configuration would be useful in debugging.

ericmjl · 2018-11-15T12:10:49Z

Thanks @jhamman! This is what the job script is:

#!/bin/bash

#!/usr/bin/env bash
#$ -N dask-worker
#$ -q default.q
#$ -l h_rt=259200
#$ -cwd
#$ -j y

activate mpnn

/path/to/anaconda/python -m distributed.cli.dask_worker tcp://10.145.61.208:39018 --nthreads 1 --memory-limit 8.00GB --name dask-worker--${JOB_ID}-- --death-timeout 60

And this is how I call the SGECluster:

cluster = SGECluster(queue='default.q',
                     walltime="259200",
                     processes=1,
                     memory='8GB',
                     cores=1,
                     env_extra=['activate mpnn'])

it's in line with how I first figured out how to make this work (without knowing about resource specs), and hence it's identical to the example on the docs, which I PR-ed in. However, when I later inspected the source code, it looked like at least the memory kwarg was not being recognized.

As for a Dask configuration, I don't have any config files at the moment.

guillaumeeb · 2018-11-15T20:52:53Z

Indeed, it looks like if user don't specify resource_spec via kwargs or config files, nothing is set in SGECluster implementation.

Something similar to what you propose is done in PBSCluster:
https://github.com/dask/dask-jobqueue/blob/master/dask_jobqueue/pbs.py#L84-L89

Would you be interested in submitting a PR that does the same for SGE? It would be very welcomed!

ericmjl · 2018-11-16T06:03:40Z

Yes, definitely, @guillaumeeb! Happy to tackle this later in the day.

ericmjl · 2019-01-15T11:55:45Z

@jhamman @guillaumeeb having had a few more chances to use the SGECluster, I have seen a few more scenarios where I think some PRs may be necessary. I am planning to close #197, and open a new discussion for this.

guillaumeeb · 2019-01-16T13:58:25Z

So are you still thinking of a PR related to this issue about resource specification? Should this issue stay open?

ericmjl · 2019-01-16T16:16:15Z

@guillaumeeb given the flexibility of grid engine systems (which I read as "quite complicated"), I think that for now, the easiest PR is the documentation PR #220. This is one of those few scenarios where I think more talking is needed up-front. Let's keep exploring with @lesteve on what is the best way forward is for GE-like clusters.

guillaumeeb · 2019-05-13T18:44:07Z

@ericmjl I feel that this issue should be closed thanks to #220 as there is no automatic solution, do we agree?

ericmjl · 2019-05-13T18:44:52Z

Yes, 100%!

ericmjl mentioned this issue Nov 16, 2018

added a memory string parser #197

Closed

jhamman added the usage question Question about using jobqueue label Nov 20, 2018

lesteve mentioned this issue Jan 30, 2019

Add support for job [task] arrays in SGE/UGE #217

Closed

ericmjl closed this as completed May 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource specification on GridEngine Clusters #195

Resource specification on GridEngine Clusters #195

ericmjl commented Nov 14, 2018 •

edited

Loading

jhamman commented Nov 14, 2018

ericmjl commented Nov 15, 2018 •

edited

Loading

guillaumeeb commented Nov 15, 2018

ericmjl commented Nov 16, 2018

ericmjl commented Jan 15, 2019 •

edited by guillaumeeb

Loading

guillaumeeb commented Jan 16, 2019

ericmjl commented Jan 16, 2019

guillaumeeb commented May 13, 2019

ericmjl commented May 13, 2019

Resource specification on GridEngine Clusters #195

Resource specification on GridEngine Clusters #195

Comments

ericmjl commented Nov 14, 2018 • edited Loading

jhamman commented Nov 14, 2018

ericmjl commented Nov 15, 2018 • edited Loading

guillaumeeb commented Nov 15, 2018

ericmjl commented Nov 16, 2018

ericmjl commented Jan 15, 2019 • edited by guillaumeeb Loading

guillaumeeb commented Jan 16, 2019

ericmjl commented Jan 16, 2019

guillaumeeb commented May 13, 2019

ericmjl commented May 13, 2019

ericmjl commented Nov 14, 2018 •

edited

Loading

ericmjl commented Nov 15, 2018 •

edited

Loading

ericmjl commented Jan 15, 2019 •

edited by guillaumeeb

Loading