Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Use regex for job-id lookup #131

Merged
merged 25 commits into from
Aug 28, 2018
Merged

WIP: Use regex for job-id lookup #131

merged 25 commits into from
Aug 28, 2018

Conversation

willirath
Copy link
Collaborator

@willirath willirath commented Aug 22, 2018

There's pbs schedulers returning with more verbose returns like: "Request 123456.asdfqwer submitted to queue: standard."

  • f0566c8 should be fully compatible to the former version.
  • It's probably a good thing to make the regex configurable. I'll go ahead and add this as well.
  • Moving the regex matching in core.py
  • Making it a static string in the code, non configurable, with a value of r'\d+'
  • Deleting all _job_id_from_submit_output from JobQueueCluster subclasses
  • Adding tests
  • Updating trouble shooting section of the docs

willirath and others added 2 commits August 22, 2018 11:23
This also captures more general output like "Request 123456.asdfqwer submitted to queue: standard."
Copy link
Member

@guillaumeeb guillaumeeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, thanks !

Out of curiosity, what PBs versions do you work with?

Other than that, this would be a good idea to generalize this behaviour to other scheduler, I believe it could prove really useful.

We could even extend it to the PBS header, as what is done in https://github.com/jupyterhub/batchspawner/blob/master/batchspawner/batchspawner.py#L467. But this is still another matter.

I would open some issues to discuss this. Thanks for this PR @willirath.

@willirath
Copy link
Collaborator Author

Out of curiosity, what PBs versions do you work with?

It's NQSII. I'm running this on an NEC Linux cluster:

$ qstat -V
NQSII CUI Version R03.00 / API Version R03.00 (linux)

@willirath
Copy link
Collaborator Author

It's NQSII. I'm running this on an NEC Linux cluster:

Once I have a fully working setup, I can add something to the example configs.

@lesteve
Copy link
Member

lesteve commented Aug 22, 2018

Not familiar with PBS but is there no command line argument you can pass to PBS qsub that just outputs the job-id (there was one for SGE qsub which is why I am asking).

I am not convinced about making the regex configurable, but maybe I am grossly underestimating the variability of possible qsub outputs (maybe each cluster defines its own qsub wrapper which prints more information to the stdout, or maybe it depends on the PBS version)

@guillaumeeb
Copy link
Member

@willirath, will you be OK to modify your PR according to what has been discussed in #132:

  • Moving the regex matching in core.py
  • Making it a static string in the code, non configurable, with a value of r'\d+'
  • Deleting all _job_id_from_submit_output from JobQueueCluster subclasses

@willirath
Copy link
Collaborator Author

I'll do that.

@willirath
Copy link
Collaborator Author

Making it a static string in the code, non configurable, with a value of r'\d+'

Just to clarify: We expect this regex to be working for all cluster? This makes it much simpler, then.

@guillaumeeb
Copy link
Member

Just to clarify: We expect this regex to be working for all cluster? This makes it much simpler, then.

Yes, that's the idea, and it should work.

Copy link
Member

@guillaumeeb guillaumeeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice, thanks for that, just a few comments.

['Request {jobid}.asdf was sumbitted to queue 12.',
'{jobid}',
' <{jobid}> ',
'{jobid}; asdf'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Might I ask you to use complete output as examples provided in #132 (comment)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll do this.

@@ -294,7 +295,7 @@ def start_workers(self, n=1):
for _ in range(num_jobs):
with self.job_file() as fn:
out = self._submit_job(fn)
job = self._job_id_from_submit_output(out.decode())
job = _job_id_from_submit_output(out.decode())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we don't want to leave it as a class method?

Just asking, I don't really have an opinion yet, and I am not a really experienced python user.

If one day we want to be able to add the regex to config file, then won't it be a problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't have any strong opinion on this. Just went for the minimal amount of code necessary to implement (and test) this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • HTCondor (taken somewhere on internet):
Submitting job(s)con.
Logging submit event(s).
1 job(s) submitted to cluster 6075.

# but there is the -terse option that should make it like
10744.0 - 10744.2

Looks like a strong case for leaving the parser a static method of the class. (This way, it would be easy to create an HTCondorCluster with a different parser.

I'll change this.

@willirath
Copy link
Collaborator Author

I think I have addressed everything. Should I rebase this into a nicer PR?

@guillaumeeb
Copy link
Member

If you mean git rebase, we will squash commit when merging anyway.
@lesteve could you take a look at this one?

def _job_id_from_submit_output(self, out):
raise NotImplementedError('_job_id_from_submit_output must be implemented when JobQueueCluster is '
'inherited. It should convert the stdout from submit_command to the job id')
@staticmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this annotation really required?

I'm not sure how it will be handled if a subclass needs to override it?

Copy link
Collaborator Author

@willirath willirath Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we don't need any implicit args

Overriding works just fine (You can also override with a non-static method):

# base class
class AClass:

    @staticmethod
    def parse_string(string):
        return string.split(".")


AClass.parse_string("one.two")  # --> ["one", "two"]


# identical copy, no override
class BClass(AClass):
    pass


BClass.parse_string("one.two")  # --> ["one", "two"]


# oeverride as static method
class CClass(AClass):

    @staticmethod
    def parse_string(string):
        return string.split(";")


CClass.parse_string("one;two")  # --> ["one", "two"]
AClass.parse_string("one;two")  # --> ["one;two"]


# override as method
class DClass(AClass):

    separator = "-"

    def parse_string(self, string):
        return string.split(self.separator)


dc = DClass()  # need to instantiate before calling
dc.parse_string("one-two")  # --> ["one", "two"]

@guillaumeeb guillaumeeb mentioned this pull request Aug 24, 2018
docs/index.rst Outdated
@@ -285,5 +285,4 @@ problems are the following:
We use submit command stdout to parse the job_id corresponding to the
launched group of worker. If the parsing fails, then dask-jobqueue won't work
as expected and may throw exceptions. You can have a look at the parsing
function in every ``JobQueueCluster`` implementation, see
``_job_id_from_submit_output`` function.
function in the ``JobQueueCluster._job_id_from_submit_output`` function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated function here, maybe we should formulate differently.

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. I have started working working on a OARCluster implementation and I realised that the \d+ regexp will not work for OAR. This is the reason of most of my comments below.

'inherited. It should convert the stdout from submit_command to the job id')
@staticmethod
def _job_id_from_submit_output(out):
return re.findall(r'\d+', out)[0]
Copy link
Member

@lesteve lesteve Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments about this:

  • I am in favour of using a class variable for the regexp pattern. This way the regexp pattern can be tweaked more easily in derived classes. It can also be tweaked by the user in the case where the cluster config is a bit special if needed:
cluster = MyCluster(...)
cluster.job_id_regexp = my_special_regexp
  • I think the best way is to have a regexp with a named match: r(?<job_id>:\d+) by default. This way you can use the name of the match which is more robust:
match = re.search(self.job_id_regexp, out)
job_id = match.groupdict.get('job_id')
  • There should be better error treatment
if job_id is None:
    raise ValueError('Could not parse job id from submission command output.'
                     ' Job id regexp is {}, submission command output is: {}'.format(
        self.job_id_regexp, out)


@pytest.mark.parametrize(
'qsub_return_string',
['{jobid}.admin01',
Copy link
Member

@lesteve lesteve Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test, could you use job_id for consistency reasons?

@willirath
Copy link
Collaborator Author

f72a7f1 should address all your points @lesteve. Let me know what you think.

@willirath
Copy link
Collaborator Author

I went for only having r'\d+' in the class var and then constructing the named group regexp in the function. Should make it easier for a user to override at runtime.

@@ -138,6 +139,7 @@ class JobQueueCluster(Cluster):
cancel_command = None
scheduler_name = ''
_adaptive_options = {'worker_key': lambda ws: _job_id_from_worker_name(ws.name)}
job_id_regexp = r'\d+'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the full regexp here, i.e. (?P<job_id>\d+). This way this is fully customisable in derived class. Currently you are assuming that the regexp starts with (?<job_id> in _job_id_from_submit_output which may not be appropriate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

raise NotImplementedError('_job_id_from_submit_output must be implemented when JobQueueCluster is '
'inherited. It should convert the stdout from submit_command to the job id')
regexp = r'(?P<job_id>{})'.format(self.job_id_regexp)
match = re.search(regexp, out)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to cover all the cases that can error with a not very helpful error message, you need to do something like this:

msg = 'Could not parse ...'

if match is None:
    raise ValueError(msg)

job_id = match.groupdict().get('job_id')

if job_id is None:
    raise ValueError(msg)

It would be great to have a test for error messages: you can use something like:

with pytest.raises(ValueError, match=error_message_regexp):
    ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The match keyword gets only a part of the expected error message. Not sure if this is enough. Looks similar to similar tests in dask-ml, though.

match = re.search(self.job_id_regexp, out)
job_id = match.group('job_id')
if job_id is None:
raise ValueError('Could not parse job id from submission command'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a space at the end of each line otherwise the text will be something like "submission commandoutput"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@willirath
Copy link
Collaborator Author

@lesteve What do you think?

if match is None:
raise ValueError(msg)

job_id = match.group('job_id')
Copy link
Member

@lesteve lesteve Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use job_id = match.groupdict().get('job_id') to have a better error as I mentioned (probably folded away in an outdated diff).

Thinking about this it would be great to have a better error in the case when there is a match but no job_id named group, i.e. something like:

"You need to use a `job_id` named group in your regexp, e.g. '(?P<job_id>\d+)'. Your regexp was: {}".format(self.job_id_regexp) 

'{job_id};cluster',
'Job <{job_id}> is submitted to default queue <normal>.',
'{job_id}',
pytest.param('{job_id}.admin01', marks=pytest.mark.xfail)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot you could do something like this with parametrize, nice! Can you explain why it is xfail though?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case {job_id} expands to XXXXX (see https://github.com/dask/dask-jobqueue/pull/131/files/9c83c296c33421c2ec496580b67a81feaa58f6b1#diff-6715b450c58d2e22119574769a68a1d6R92). Then, the 01 in admin01 would be interpreted as the job_id.


@pytest.mark.parametrize('Cluster', [PBSCluster, MoabCluster, SLURMCluster,
SGECluster, LSFCluster])
@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General remark, for the error cases I would remove this parametrize and just test with something like this:

  • an output without match: e.g. 'there is no number here'
  • a regexp that match but that does not have a job_id named group, e.g. cluster.job_id_regexp = r'\d+' and an output 'Job <12345> is submitted to default queue <normal>.

@willirath
Copy link
Collaborator Author

@lesteve I think, I've addressed everything, now.

(I won't probably get back to this before Wednesday. So feel free to push stuff to this branch. :) )

@lesteve
Copy link
Member

lesteve commented Aug 28, 2018

I pushed some small tweaks, merging, thanks a lot @willirath!

@jhamman
Copy link
Member

jhamman commented Aug 28, 2018

I just want to say, I am a fan of the general approach here. I'm also glad to see some tests for parsing these job ids. Thanks to all of you for ticking this one off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants