Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torque module doesn't support job arrays #64

Open
GoogleCodeExporter opened this issue May 11, 2015 · 8 comments
Open

Torque module doesn't support job arrays #64

GoogleCodeExporter opened this issue May 11, 2015 · 8 comments

Comments

@GoogleCodeExporter
Copy link

The torque module works great for normal jobs but seems to fail with job arrays.

pdsh -j 12308812[27] hostname
pdsh@nyx: invalid jobid format "12308812[27]" for -j
[root@nyx pdsh]# pdsh -j '12308812[27]' hostname
pdsh@nyx: invalid jobid format "12308812[27]" for -j

Is this a known issue?

Thanks,
 - Matt

Original issue reported on code.google.com by msbr...@umich.edu on 21 May 2014 at 3:16

@GoogleCodeExporter
Copy link
Author

Good catch. However, I do not have a Torque system on which to test.
The original work was done by someone else.

However, looking at the code it does seem it assumes all jobids will be 
convertible
to numbers, so that step in the code is failing.

If you can point me to a reference for querying nodes in array jobs from the C 
API,
I could take a stab at providing you a patch for testing. Or, if you are feeling
adventurous, you could propose a patch yourself.

Thanks
mark

Original comment by mark.gro...@gmail.com on 21 May 2014 at 3:22

  • Changed state: Accepted

@planetmija
Copy link

Hi,

I would be interested in this as well. I found the torque source code here:
https://github.com/adaptivecomputing/torque

Usually jobids are in the for of 12345 but in arrays an index is added, so it looks like 12345[1], 12345[2], ...

Obviously the code exits when it tries to convert the string "jobid" to long int "jid" with "strtoul". So preserving the array index or use char instead of int and just regex for [0-9[]] should work, the rest is actually the same.

Best,
Michael

@grondo
Copy link
Member

grondo commented Aug 17, 2016

What if you just take out the code that verifies the jobid, and let Torque do the verification?

As I said, I don't personally have access to a Torque/PBS system on which to test, and I'm not even the author of the original code, but perhaps you could test something like this?

Feel free to open a PR if you find something that works, or have other fixes. Thanks!

diff --git a/src/modules/torque.c b/src/modules/torque.c
index 4d7f6e3..2ceacd1 100644
--- a/src/modules/torque.c
+++ b/src/modules/torque.c
@@ -124,23 +124,6 @@ static int mod_torque_init (void)
 }


-static int32_t str2jobid (const char *str)
-{
-    char *p = NULL;
-    long int jid;
-
-    if (str == NULL)
-        return (-1);
-
-    jid = strtoul (str, &p, 10);
-
-    if( *p != '\0' )
-        errx ("%p: invalid jobid format \"%s\" for -j\n", str);
-
-    return ((int32_t) jid);
-}
-
-
 static int
 torque_process_opt(opt_t *pdsh_opts, int opt, char *arg)
 {
@@ -180,14 +163,8 @@ static int mod_torque_wcoll(opt_t *opt)
     return 0;
 }

-static void _create_fq_jobid(char *dst, const char *jobid, const char *serverna
-  /*
-   *  Create fully qualified jobid, "<integer>.<servername>"
-   */
-    if ( str2jobid(jobid) < 0 ){
-        *dst = '\0';
-        return;
-    }
+static void _create_fq_jobid(char *dst, const char *jobid, const char *serverna
+{
     strncpy(dst, jobid, PBS_MAXSEQNUM);
     strncat(dst, ".", 1);
     strncat(dst, servername, PBS_MAXSERVERNAME);

@planetmija
Copy link

Obviously the easiest way :)

pdsh -j 35985 hostname
n3551: n3551.tld
pdsh -j 35984[1] hostname
n3552: n3552.tld

Thanks a lot!

Complete diff (not truncated):

diff --git a/src/modules/torque.c b/src/modules/torque.c
index 4d7f6e3..8f56af0 100644
--- a/src/modules/torque.c
+++ b/src/modules/torque.c
@@ -124,23 +124,6 @@ static int mod_torque_init (void)
 }


-static int32_t str2jobid (const char *str)
-{
-    char *p = NULL;
-    long int jid;
-
-    if (str == NULL)
-        return (-1);
-
-    jid = strtoul (str, &p, 10);
-
-    if( *p != '\0' )
-        errx ("%p: invalid jobid format \"%s\" for -j\n", str);
-
-    return ((int32_t) jid);
-}
-
-
 static int
 torque_process_opt(opt_t *pdsh_opts, int opt, char *arg)
 {
@@ -180,14 +163,11 @@ static int mod_torque_wcoll(opt_t *opt)
     return 0;
 }

-static void _create_fq_jobid(char *dst, const char *jobid, const char *servername){
+static void _create_fq_jobid(char *dst, const char *jobid, const char *servername)
+{
   /*
    *  Create fully qualified jobid, "<integer>.<servername>"
    */
-    if ( str2jobid(jobid) < 0 ){
-        *dst = '\0';
-        return;
-    }
     strncpy(dst, jobid, PBS_MAXSEQNUM);
     strncat(dst, ".", 1);
     strncat(dst, servername, PBS_MAXSERVERNAME);

@planetmija
Copy link

planetmija commented Aug 17, 2016

For completion here errors for malformed IDs:

pdsh@login1: Failed to retrieve information for job 123234.tld: (15001) Unknown Job Id Error
pdsh@login1: no remote hosts specified

pdsh@login1: Failed to retrieve information for job a123234.tld: (15020) Unknown queue
pdsh@login1: no remote hosts specified

pdsh@login1: Failed to retrieve information for job 123234a.tld: (15001) Unknown Job Id Error
pdsh@login1: no remote hosts specified

pdsh@login1: Failed to retrieve information for job 1a2b3b2b.tld: (15001) Unknown Job Id Error

Corresponding Torque jobs stats:

qstat: Unknown Job Id Error 123234.tld

qstat: illegally formed job identifier: 123234d
qstat: Error (1 - Operation not permitted) 

qstat: Unknown queue destination a123234

This seems OK. Works with Moab numeric IDs and Moab job arrays as well.

@grondo
Copy link
Member

grondo commented Aug 17, 2016

Great, thanks for testing. Just to verify, should I throw this commit into master?

@planetmija
Copy link

I would say yes. It's less broken than before AFAICT. And if the module gets a wrong ID Torque just complains. If it is OK it gets the Job stats correctly. I think there's no need to check if the ID is formatted correctly.

Thanks a lot again.
Best regards,
Michael

@grondo
Copy link
Member

grondo commented Aug 17, 2016

Thank you for testing!

On Wed, Aug 17, 2016 at 2:44 PM, Michael notifications@github.com wrote:

I would say yes. It's less broken than before AFAICT. And if the module
gets a wrong ID Torque just complains. If it is OK it gets the Job stats
correctly. I think there's no need to check if the ID is formatted
correctly.

Thanks a lot again.
Best regards,
Michael


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grondo/pdsh/issues/64#issuecomment-240558801, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAtSUijGxsxf8ROQ1pcevgww1LZ7Jr_Rks5qg4C-gaJpZM4Jm1pO
.

grondo added a commit that referenced this issue Aug 17, 2016
Torque/PBS jobids are not necessarily strict integers, for example
array jobs have the array index in brackets following the jobid.

Since jobids are passed to Torque API as strings, just skip the
attempt to convert to integer (and corresponding error for array
jobs), and let the Torque API report any jobid error back to user.

Fixes #64
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants