-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lsf #78
Add lsf #78
Conversation
dask_jobqueue/jobqueue.yaml
Outdated
extra: "" | ||
env-extra: [] | ||
job-cpu: null | ||
job-mem: null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, can we use the memory
and threads * processes
entries to remove the need for these entries? This is probably also a question for other dask-jobqueue maintainers, as I know that these appear in other configurations.
It would be useful to have a basic test, even if it only performs sanity-checks on the header. Here is an example for SLURM dask-jobqueue/dask_jobqueue/tests/test_slurm.py Lines 11 to 21 in 4cd24dc
|
Thanks for working on this! |
Could you please share how this should be tested? In particular, would be good to have a short script to try starting up Dask on LSF and running some simple computation with it. |
I would expect the following to work for any job-queue cluster from dask_jobqueue import LSFCluster
cluster = LSFCluster()
from dask.distributed import Client
client = Client(cluster)
assert client.submit(lambda x: x + 1, 10).result() == 11 |
Hitting a wall here. When I run Doing
gives If I manually copy this to a sumbit script (submit.sh) e.g.
and do
but within python I seem to be struggling to get any workers. |
You probably want to use the following options to indicate the use of one node with several processes:
see https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_admin/span_string.html or https://www.hpc.dtu.dk/?page_id=1401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work here, hoping my comments will help you finish this!
dask_jobqueue/lsf.py
Outdated
logger.debug("Job script: \n %s" % self.job_script()) | ||
|
||
def _job_id_from_submit_output(self, out): | ||
return out.split('.')[0].strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you need to either do a parsing of the output string of the job, which seems to be something like:
Job is submitted to <cpp> project.
Job <16633929> is submitted to default queue <general>.
But you more likely want to find the correct option for just outputing the job ID after the bsub command. Unfortunatly I did not find such an option after a few minutes Google search.
dask_jobqueue/lsf.py
Outdated
""", 4) | ||
|
||
# Override class variables | ||
submit_command = 'bsub' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand from your last comment, the submit_command
should be bsub <
instead of just bsub
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raybellwaves any response to this comment?
Thanks for the comments @guillaumeeb! I think LSF is unique compared to the others in terms of job submitting. It now works when I set off workers:
Although It still just hangs without setting off any workers:
|
Well my hack with |
See how this is done in ipyparallel: https://github.com/ipython/ipyparallel/blob/6.1.1/ipyparallel/apps/launcher.py#L1397. Maybe you could try to use a similar syntax. There is still the problem of Maybe we should change the way core.py is written and have dedicated start and stop functions, that could be easily overridden if needed. Or just add a |
Not sure what the best way is, but it looks like we need some special treatment for LSF indeed since it takes stdin and not the script name. Maybe something like this (I am guessing this is similar to your # in JobQueueCluster
def submit_job(self, script_filename):
return self._call(shlex.split(self.submit_command) + [script_filename])
def start_workers(self, n=1):
...
with self.job_file() as fn:
out = self.submit_job(fn)
... # in LSFCluster
def submit_job(self, filename):
# note popen_kwargs needs to be added to _call so we can pass shell=True
return self._call(shlex.split(self.submit_command) + ['<', script_filename],
popen_kwargs={'shell': True}) |
Thanks for the discussion on this. I'll wait until a decision is made. Unfortunately, LSF seems to be the black sheep and it looks as though it will have to be handled specifically. A I can work on the Docker files in the meantime. I've not used docker before so if anyone can point me to some resources for setting up LSF in docker that would be appreciated (actually i'd prefer this to be a separate PR). Lastly, i'll make the changes to reflect the latest PR. Thanks @mrocklin for your work on that. |
Something like what @lesteve proposes seems reasonable to me. @raybellwaves what do you think we should do? |
Thanks for the suggestion @lesteve. I believe i've got it working for myself and hopefully it should still work for others. I'll have to test tomorrow though as the queue is jammed. |
Great to hear, let us know when you think this is ready for review! |
Thanks! Someone was asking me about LSF support just yesterday, so I know
that this will be a welcome change!
…On Wed, Jul 11, 2018 at 6:14 AM, Loïc Estève ***@***.***> wrote:
Great to hear, let us know when you think this is ready for review!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#78 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszMnieHsq7rgnmrJQIatKbZRiz5j6ks5uFc_zgaJpZM4U5DCl>
.
|
Fixing the tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I'll be happy to see this in. A couple small comments.
dask_jobqueue/core.py
Outdated
self.shell = True | ||
return self._call(piped_cmd) | ||
else: | ||
return self._call(shlex.split(self.submit_command) + [script_filename]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to keep the core solution simple and instead put the LSF-specific implementation on the LSFCluster
class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see what @lesteve was suggesting
dask_jobqueue/lsf.py
Outdated
if walltime is None: | ||
walltime = dask.config.get('jobqueue.%s.walltime' % self.scheduler_name) | ||
if job_extra is None: | ||
job_extra = dask.config.get('jobqueue.lsf.job-extra') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also use the self.scheduler_name
pattern as above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
@jhamman good spot. Everything is working again. |
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @raybellwaves . Generally this looks good. I've highlighted a couple of points that I think we can improve somewhat easily. Let me know what you think.
dask_jobqueue/core.py
Outdated
@@ -280,13 +282,16 @@ def job_file(self): | |||
f.write(self.job_script()) | |||
yield fn | |||
|
|||
def submit_job(self, script_filename): | |||
return self._call(shlex.split(self.submit_command) + [script_filename]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a private method so that users don't get the wrong idea that they should use it to submit jobs.
env-extra: [] | ||
ncpus: null | ||
mem: null | ||
job-extra: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking in, does LSF need/use all of these options? I'm slightly concerned that as we copy configs from different systems we may accrue more than is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only ones I haven't used are extra
and env-extra
. extra
is for Additional arguments to pass to dask-worker
so that is probably worth keeping. env-extra
is other commands to the script before launching the worker. I can't see myself using this but core.py
checks for it
dask-jobqueue/dask_jobqueue/core.py
Line 109 in f7c565a
env_extra : list |
A future PR could be to move that out of core.py
and have users specify it in their individual classes. I'll let you decide that.
dask_jobqueue/lsf.py
Outdated
""", 4) | ||
|
||
# Override class variables | ||
submit_command = 'bsub' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raybellwaves any response to this comment?
dask_jobqueue/lsf.py
Outdated
`Job is submitted to <PROJECT> project.` which is the stderr and | ||
`Job <JOBID> is submitted to (default) queue <QUEUE>.` which is the stdout. | ||
Supress the stderr by redirecting it to nowhere. | ||
The `piped_cmd` looks like ['bsub < tmp.sh 2> /dev/null'] """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring looks a bit wonky. It looks more like a developer comment than a documentation string for users. If it is supposed to be a docstring then you might want to follow the numpydoc standard, or take a look at the dask developer notes on docstrings: http://dask.pydata.org/en/latest/develop.html#docstrings
dask_jobqueue/lsf.py
Outdated
`Job <JOBID> is submitted to (default) queue <QUEUE>.` which is the stdout. | ||
Supress the stderr by redirecting it to nowhere. | ||
The `piped_cmd` looks like ['bsub < tmp.sh 2> /dev/null'] """ | ||
self.popen_shell = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd to set this here again. It was already set above on the class, right?
dask_jobqueue/lsf.py
Outdated
Supress the stderr by redirecting it to nowhere. | ||
The `piped_cmd` looks like ['bsub < tmp.sh 2> /dev/null'] """ | ||
self.popen_shell = True | ||
piped_cmd = [self.submit_command+' < '+script_filename+' 2> /dev/null'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some minor style issues. I recommend running flake8 on the codebase.
mrocklin@carbon:~/workspace/dask-jobqueue$ flake8 dask_jobqueue
dask_jobqueue/lsf.py:5:1: F401 'os' imported but unused
dask_jobqueue/lsf.py:69:13: F841 local variable 'memory' is assigned to but never used
dask_jobqueue/lsf.py:124:41: E226 missing whitespace around arithmetic operator
dask_jobqueue/lsf.py:124:47: E226 missing whitespace around arithmetic operator
dask_jobqueue/lsf.py:124:63: E226 missing whitespace around arithmetic operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened #106 to discuss including this in the CI
dask_jobqueue/lsf.py
Outdated
|
||
def lsf_format_bytes_ceil(n): | ||
""" Format bytes as text | ||
LSF expects megabytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raybellwaves any response to this comment?
dask_jobqueue/lsf.py
Outdated
|
||
def lsf_format_bytes_ceil(n): | ||
""" Format bytes as text | ||
LSF expects megabytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it would be good to format this docstring like http://dask.pydata.org/en/latest/develop.html#docstrings
dask_jobqueue/lsf.py
Outdated
|
||
def stop_jobs(self, jobs): | ||
""" set `self.popen_shell = False` """ | ||
self.popen_shell = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, you're using the class state to sneak a parameter into the ._call
method. I think it's probably better to pass extra keyword arguments into the ._call
method directly and avoid the extra state. I'll write a comment on this in the core.py file.
dask_jobqueue/core.py
Outdated
@@ -322,6 +327,7 @@ def _calls(self, cmds): | |||
for cmd in cmds: | |||
logger.debug(' '.join(cmd)) | |||
procs.append(subprocess.Popen(cmd, | |||
shell=self.popen_shell, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend that we pass **kwargs
, or at least shell=
from the _call
method into this function call and avoid the state entirely.
def _call(self, cmd, **kwargs):
return self._calls([cmd], **kwargs)
def _calls(self, cmds, **kwargs):
...
procs.append(subprocess.Popen(cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
**kwargs))
Then we can call this like self._call(cmd, shell=True)
and avoid mucking about with the popen_shell
state.
Then lets drop them
…On Fri, Jul 27, 2018 at 11:23 AM, Ray Bell ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dask_jobqueue/jobqueue.yaml
<#78 (comment)>:
> + memory: null # Total amount of memory per job
+ processes: 1 # Number of Python processes per job
+
+ interface: null # Network interface to use like eth0 or ib0
+ death-timeout: 60 # Number of seconds to wait if a worker can not find a scheduler
+ local-directory: null # Location of fast local storage like /scratch or $TMPDIR
+
+ # LSF resource manager options
+ queue: null
+ project: null
+ walltime: '00:30'
+ extra: ""
+ env-extra: []
+ ncpus: null
+ mem: null
+ job-extra: []
Good question. I'm not sure what extra and env-extra are for. I haven't
been using them in my development/testing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#78 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszHUgEbl3e5IBRn_0qfsR_xpLdQ1tks5uK1qmgaJpZM4U5DCl>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment on stop_jobs.
We are close, I believe that after stating about this one we will be able to merge, thanks again for the time taken.
dask_jobqueue/lsf.py
Outdated
logger.debug("Stopping jobs: %s" % jobs) | ||
if jobs: | ||
jobs = list(jobs) | ||
self._call([self.cancel_command] + list(set(jobs)), shell=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no state anymore, and shell=False
is default with popen
, you probably don't need it here, and thus you probably don't need to redefine stop_jobs
in Lsf implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Good spot.
This now seems fine to me. @raybellwaves can you verify that things work well on your LSF cluster? |
We can also get feedback from folks in #4 after merging |
Merging this tomorrow if there are no further comments. If anyone gets to this before I do and are ok with things I encourage you to merge if you think it's ready. |
Thanks @raybellwaves for sticking with this. The only thing I see this needing is a few docs. See the {index,examples,configurations,api}.rst files for some good places to document the LSFCluster. Bare minimum would be |
Looks like the api.rst change would be a single line. I'm in favor of that
:) It'd be nice to get this in.
…On Tue, Jul 31, 2018 at 3:34 PM, Joe Hamman ***@***.***> wrote:
Thanks @raybellwaves <https://github.com/raybellwaves> for sticking with
this. The only thing I see this needing is a few docs.
See the {index,examples,configurations,api}.rst files for some good
places to document the LSFCluster. Bare minimum would be api.rst.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#78 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszPCGMi-TaCDRYzidozFQhR5_y4N9ks5uMNuNgaJpZM4U5DCl>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all that, I think this is ready.
This is in. Thank you @raybellwaves for implementing this! I think that it will be very valuable |
Very nice! |
Big thanks to your all for creating the package and your teachings. |
working on #4
Thanks to @jhamman for his (in person) help with this. Hopefully I can get this squared away by the end of the week.
FYI I'm testing on python 3.6 at pegasus (University of Miami's HPC) and I was getting an un-obvious
psutil
error (see below). After installing the dependencies I installed a conda version ofpsutil
(conda install -c conda-forge psutil
) and it went away.