-
-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Use regex for job-id lookup #131
Conversation
This also captures more general output like "Request 123456.asdfqwer submitted to queue: standard."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good, thanks !
Out of curiosity, what PBs versions do you work with?
Other than that, this would be a good idea to generalize this behaviour to other scheduler, I believe it could prove really useful.
We could even extend it to the PBS header, as what is done in https://github.com/jupyterhub/batchspawner/blob/master/batchspawner/batchspawner.py#L467. But this is still another matter.
I would open some issues to discuss this. Thanks for this PR @willirath.
It's NQSII. I'm running this on an NEC Linux cluster: $ qstat -V
NQSII CUI Version R03.00 / API Version R03.00 (linux) |
Once I have a fully working setup, I can add something to the example configs. |
Not familiar with PBS but is there no command line argument you can pass to PBS qsub that just outputs the job-id (there was one for SGE qsub which is why I am asking). I am not convinced about making the regex configurable, but maybe I am grossly underestimating the variability of possible qsub outputs (maybe each cluster defines its own qsub wrapper which prints more information to the stdout, or maybe it depends on the PBS version) |
@willirath, will you be OK to modify your PR according to what has been discussed in #132:
|
I'll do that. |
Just to clarify: We expect this regex to be working for all cluster? This makes it much simpler, then. |
Yes, that's the idea, and it should work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice, thanks for that, just a few comments.
['Request {jobid}.asdf was sumbitted to queue 12.', | ||
'{jobid}', | ||
' <{jobid}> ', | ||
'{jobid}; asdf']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Might I ask you to use complete output as examples provided in #132 (comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll do this.
dask_jobqueue/core.py
Outdated
@@ -294,7 +295,7 @@ def start_workers(self, n=1): | |||
for _ in range(num_jobs): | |||
with self.job_file() as fn: | |||
out = self._submit_job(fn) | |||
job = self._job_id_from_submit_output(out.decode()) | |||
job = _job_id_from_submit_output(out.decode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we don't want to leave it as a class method?
Just asking, I don't really have an opinion yet, and I am not a really experienced python user.
If one day we want to be able to add the regex to config file, then won't it be a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't have any strong opinion on this. Just went for the minimal amount of code necessary to implement (and test) this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- HTCondor (taken somewhere on internet):
Submitting job(s)con. Logging submit event(s). 1 job(s) submitted to cluster 6075. # but there is the -terse option that should make it like 10744.0 - 10744.2
Looks like a strong case for leaving the parser a static method of the class. (This way, it would be easy to create an HTCondorCluster
with a different parser.
I'll change this.
I think I have addressed everything. Should I rebase this into a nicer PR? |
If you mean git rebase, we will squash commit when merging anyway. |
dask_jobqueue/core.py
Outdated
def _job_id_from_submit_output(self, out): | ||
raise NotImplementedError('_job_id_from_submit_output must be implemented when JobQueueCluster is ' | ||
'inherited. It should convert the stdout from submit_command to the job id') | ||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this annotation really required?
I'm not sure how it will be handled if a subclass needs to override it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we don't need any implicit args
Overriding works just fine (You can also override with a non-static method):
# base class
class AClass:
@staticmethod
def parse_string(string):
return string.split(".")
AClass.parse_string("one.two") # --> ["one", "two"]
# identical copy, no override
class BClass(AClass):
pass
BClass.parse_string("one.two") # --> ["one", "two"]
# oeverride as static method
class CClass(AClass):
@staticmethod
def parse_string(string):
return string.split(";")
CClass.parse_string("one;two") # --> ["one", "two"]
AClass.parse_string("one;two") # --> ["one;two"]
# override as method
class DClass(AClass):
separator = "-"
def parse_string(self, string):
return string.split(self.separator)
dc = DClass() # need to instantiate before calling
dc.parse_string("one-two") # --> ["one", "two"]
docs/index.rst
Outdated
@@ -285,5 +285,4 @@ problems are the following: | |||
We use submit command stdout to parse the job_id corresponding to the | |||
launched group of worker. If the parsing fails, then dask-jobqueue won't work | |||
as expected and may throw exceptions. You can have a look at the parsing | |||
function in every ``JobQueueCluster`` implementation, see | |||
``_job_id_from_submit_output`` function. | |||
function in the ``JobQueueCluster._job_id_from_submit_output`` function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated function here, maybe we should formulate differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. I have started working working on a OARCluster implementation and I realised that the \d+
regexp will not work for OAR. This is the reason of most of my comments below.
dask_jobqueue/core.py
Outdated
'inherited. It should convert the stdout from submit_command to the job id') | ||
@staticmethod | ||
def _job_id_from_submit_output(out): | ||
return re.findall(r'\d+', out)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments about this:
- I am in favour of using a class variable for the regexp pattern. This way the regexp pattern can be tweaked more easily in derived classes. It can also be tweaked by the user in the case where the cluster config is a bit special if needed:
cluster = MyCluster(...)
cluster.job_id_regexp = my_special_regexp
- I think the best way is to have a regexp with a named match: r
(?<job_id>:\d+)
by default. This way you can use the name of the match which is more robust:
match = re.search(self.job_id_regexp, out)
job_id = match.groupdict.get('job_id')
- There should be better error treatment
if job_id is None:
raise ValueError('Could not parse job id from submission command output.'
' Job id regexp is {}, submission command output is: {}'.format(
self.job_id_regexp, out)
|
||
@pytest.mark.parametrize( | ||
'qsub_return_string', | ||
['{jobid}.admin01', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test, could you use job_id
for consistency reasons?
I went for only having |
dask_jobqueue/core.py
Outdated
@@ -138,6 +139,7 @@ class JobQueueCluster(Cluster): | |||
cancel_command = None | |||
scheduler_name = '' | |||
_adaptive_options = {'worker_key': lambda ws: _job_id_from_worker_name(ws.name)} | |||
job_id_regexp = r'\d+' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the full regexp here, i.e. (?P<job_id>\d+)
. This way this is fully customisable in derived class. Currently you are assuming that the regexp starts with (?<job_id>
in _job_id_from_submit_output
which may not be appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
dask_jobqueue/core.py
Outdated
raise NotImplementedError('_job_id_from_submit_output must be implemented when JobQueueCluster is ' | ||
'inherited. It should convert the stdout from submit_command to the job id') | ||
regexp = r'(?P<job_id>{})'.format(self.job_id_regexp) | ||
match = re.search(regexp, out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to cover all the cases that can error with a not very helpful error message, you need to do something like this:
msg = 'Could not parse ...'
if match is None:
raise ValueError(msg)
job_id = match.groupdict().get('job_id')
if job_id is None:
raise ValueError(msg)
It would be great to have a test for error messages: you can use something like:
with pytest.raises(ValueError, match=error_message_regexp):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The match
keyword gets only a part of the expected error message. Not sure if this is enough. Looks similar to similar tests in dask-ml
, though.
dask_jobqueue/core.py
Outdated
match = re.search(self.job_id_regexp, out) | ||
job_id = match.group('job_id') | ||
if job_id is None: | ||
raise ValueError('Could not parse job id from submission command' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a space at the end of each line otherwise the text will be something like "submission commandoutput"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@lesteve What do you think? |
dask_jobqueue/core.py
Outdated
if match is None: | ||
raise ValueError(msg) | ||
|
||
job_id = match.group('job_id') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use job_id = match.groupdict().get('job_id')
to have a better error as I mentioned (probably folded away in an outdated diff).
Thinking about this it would be great to have a better error in the case when there is a match but no job_id
named group, i.e. something like:
"You need to use a `job_id` named group in your regexp, e.g. '(?P<job_id>\d+)'. Your regexp was: {}".format(self.job_id_regexp)
'{job_id};cluster', | ||
'Job <{job_id}> is submitted to default queue <normal>.', | ||
'{job_id}', | ||
pytest.param('{job_id}.admin01', marks=pytest.mark.xfail)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot you could do something like this with parametrize, nice! Can you explain why it is xfail though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case {job_id}
expands to XXXXX
(see https://github.com/dask/dask-jobqueue/pull/131/files/9c83c296c33421c2ec496580b67a81feaa58f6b1#diff-6715b450c58d2e22119574769a68a1d6R92). Then, the 01
in admin01
would be interpreted as the job_id
.
|
||
@pytest.mark.parametrize('Cluster', [PBSCluster, MoabCluster, SLURMCluster, | ||
SGECluster, LSFCluster]) | ||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General remark, for the error cases I would remove this parametrize and just test with something like this:
- an output without match: e.g.
'there is no number here'
- a regexp that match but that does not have a
job_id
named group, e.g.cluster.job_id_regexp = r'\d+'
and an output'Job <12345> is submitted to default queue <normal>.
@lesteve I think, I've addressed everything, now. (I won't probably get back to this before Wednesday. So feel free to push stuff to this branch. :) ) |
I pushed some small tweaks, merging, thanks a lot @willirath! |
I just want to say, I am a fan of the general approach here. I'm also glad to see some tests for parsing these job ids. Thanks to all of you for ticking this one off. |
There's pbs schedulers returning with more verbose returns like: "Request 123456.asdfqwer submitted to queue: standard."
f0566c8 should be fully compatible to the former version.It's probably a good thing to make the regex configurable. I'll go ahead and add this as well.