Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource specification on GridEngine Clusters #195

Closed
ericmjl opened this issue Nov 14, 2018 · 9 comments
Closed

Resource specification on GridEngine Clusters #195

ericmjl opened this issue Nov 14, 2018 · 9 comments
Labels
usage question Question about using jobqueue

Comments

@ericmjl
Copy link
Contributor

ericmjl commented Nov 14, 2018

Hi Dask team!

My colleague @sntgluca and I have been very enthusiastic about the possibilities enabled by dask-jobqueue at NIBR! It's been a very productivity-enhancing tool for me. At the same time, we found something that we think might be a be a bug, but would like to disprove/confirm this before potentially working on a PR to fix it.

Firstly, we found that when using the memory keyword argument, the SGECluster will show that XGB per worker node is allocated. However, the amount of RAM that is allocated, according to the queueing status screen, is only the default amount specified by the sysadmins.

Here is some evidence that I collected from the logs on our machines.

Firstly, Dask's worker logs show that 8GB is allocated to them:

# Ignore this line, it is here to show the job ID.
/path/to/job_scripts/16524465: line 10: /path/to/activate: Permission denied
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.00 GB

However, for the same job ID, when I used qstat -j JOBID:

==============================================================
job_number:                 16524465
...
sge_o_path:                 hard resource_list:         m_mem_free=4G,h_rt=259200,slot_limitA=1
...
granted_req.          1:    m_mem_free=4.000G, slot_limitA=1

As you can see, the resources granted were only 4GB of memory, not the 8GB requested, but the Dask worker logs show 8GB being allocated.

I have a hunch that this is a bug, but both @sntgluca have this idea that our end users shouldn't have to worry about GridEngine resource spec strings, and should be able to use the very nice SGECluster API to set these parameters correctly. Looking at the SGECluster source code, it looks doable with a small-ish PR to parse the memory (and other kwargs) into the correct resource specification string, if this is something that you would be open to.

Please let us know!

@jhamman
Copy link
Member

jhamman commented Nov 14, 2018

Thanks for the report @ericmjl. Can you show us the job script that jobqueue is submitting? (http://jobqueue.dask.org/en/latest/debug.html#checking-job-script)

Also, showing us how your calling SGECluster and your dask configuration would be useful in debugging.

@ericmjl
Copy link
Contributor Author

ericmjl commented Nov 15, 2018

Thanks @jhamman! This is what the job script is:

#!/bin/bash

#!/usr/bin/env bash
#$ -N dask-worker
#$ -q default.q
#$ -l h_rt=259200
#$ -cwd
#$ -j y

activate mpnn

/path/to/anaconda/python -m distributed.cli.dask_worker tcp://10.145.61.208:39018 --nthreads 1 --memory-limit 8.00GB --name dask-worker--${JOB_ID}-- --death-timeout 60

And this is how I call the SGECluster:

cluster = SGECluster(queue='default.q',
                     walltime="259200",
                     processes=1,
                     memory='8GB',
                     cores=1,
                     env_extra=['activate mpnn'])

it's in line with how I first figured out how to make this work (without knowing about resource specs), and hence it's identical to the example on the docs, which I PR-ed in. However, when I later inspected the source code, it looked like at least the memory kwarg was not being recognized.

As for a Dask configuration, I don't have any config files at the moment.

@guillaumeeb
Copy link
Member

Indeed, it looks like if user don't specify resource_spec via kwargs or config files, nothing is set in SGECluster implementation.

Something similar to what you propose is done in PBSCluster:
https://github.com/dask/dask-jobqueue/blob/master/dask_jobqueue/pbs.py#L84-L89

Would you be interested in submitting a PR that does the same for SGE? It would be very welcomed!

@ericmjl
Copy link
Contributor Author

ericmjl commented Nov 16, 2018

Yes, definitely, @guillaumeeb! Happy to tackle this later in the day.

@jhamman jhamman added the usage question Question about using jobqueue label Nov 20, 2018
@ericmjl
Copy link
Contributor Author

ericmjl commented Jan 15, 2019

@jhamman @guillaumeeb having had a few more chances to use the SGECluster, I have seen a few more scenarios where I think some PRs may be necessary. I am planning to close #197, and open a new discussion for this.

@guillaumeeb
Copy link
Member

So are you still thinking of a PR related to this issue about resource specification? Should this issue stay open?

@ericmjl
Copy link
Contributor Author

ericmjl commented Jan 16, 2019

@guillaumeeb given the flexibility of grid engine systems (which I read as "quite complicated"), I think that for now, the easiest PR is the documentation PR #220. This is one of those few scenarios where I think more talking is needed up-front. Let's keep exploring with @lesteve on what is the best way forward is for GE-like clusters.

@guillaumeeb
Copy link
Member

@ericmjl I feel that this issue should be closed thanks to #220 as there is no automatic solution, do we agree?

@ericmjl
Copy link
Contributor Author

ericmjl commented May 13, 2019

Yes, 100%!

@ericmjl ericmjl closed this as completed May 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage question Question about using jobqueue
Projects
None yet
Development

No branches or pull requests

3 participants