pm -q fails on large input queries #193

chu11 · 2024-06-20T23:28:53Z

this works

> pm -q $(flux hostlist -L 1635 "$(cluset -f 'myclu[1001-5000/2]')")

but these don't

> pm -q $(flux hostlist -L 1636 "$(cluset -f 'myclu[1001-5000/2]')")
Command too long

> pm -q $(flux hostlist -L 1638 "$(cluset -f 'myclu[1001-5000/2]')")
pm: hostlist error

(FYI the bash command above leads to

myclu[1001,1003,1005,1007,1009,1011,...,4275]

for the -L 1638 case)

The "Command too long" comes from exceeding CP_LINEMAX on the client side.

The hostlist error comes from:

    char argument[CP_LINEMAX];
    if (hostlist_ranged_string(targets, sizeof(argument), argument) == -1)
        err_exit(false, "hostlist error");

which is also using a CP_LINEMAX buffer.

I would like to think just doubling or tripling the CP_LINEMAX will be fine, although ...

/* NOTE: sscanf is safe because 'str' is guaranteed to be < CP_LINEMAX */

although that comment is from 2002 and I would like to think longer strings are ok now (a cursory googling did not give me an immediate answer).

The text was updated successfully, but these errors were encountered:

garlick · 2024-06-21T01:42:16Z

I don't know if it will fix the hostlist error but we might want to consider pulling in the hostlist implementation from flux since it's actually had some attention paid to it in the last decade.

I don't think the NOTE: comment is about scanf itself. it's about the source and destination buffers. It's say the source is max CP_LINEMAX so it's OK to sscanf %s into another CP_LINEMAX buffer. So I think increasing CP_LINEMAX should be fine (and would have been fine in 2002 also).

chu11 · 2024-06-21T17:51:38Z

doubling the CP_LINEMAX to 16K seemed to solve the above problems.

So for the worst case input (every other node in a cluster hostrange) the limit is about a 3200 node cluster with a 8K CP_LINEMAX buffer (b/c the bug was hit around node 4275. note the example above started at node 1000 to make the host number length consistent throughout, so it's really a bit more than 3200 nodes). 3200 nodes is certainly smaller than the biggest HPC clusters out there nowadays according to Google.

Soooo

16K buffer - worst case of 6400 node cluster
32K buffer - ~12000 node cluster
64K buffer - ~24000 node cluster
128K buffer - ~48000 node cluster

I'd say lets up it to 128K to future proof this for quite some time.

Edit: and obviously divide by 2 if someone is doing dumb things with hostlist ... i.e. node[0,1,2,3,4,5 .... 7998, 7999, 8000]

Problem: On very larger clusters command line operations of a large number of nodes (e.g. node[1,3,5,...,4997,4999]) can exceed internal buffers, leading to errors. Increase the internal maximum line length from 8K to 128K, which should handle any HPC cluster in existence today and hopefully many more into the future. Fixes chaos#193

chu11 mentioned this issue Jun 21, 2024

powerman: increase maximum line length #194

Merged

mergify bot closed this as completed in #194 Jul 3, 2024

mergify bot closed this as completed in d4ba511 Jul 3, 2024

chu11 mentioned this issue Aug 16, 2024

nodeattr --compress has 1024 character output limit chaos/genders#62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pm -q fails on large input queries #193

pm -q fails on large input queries #193

chu11 commented Jun 20, 2024 •

edited

Loading

garlick commented Jun 21, 2024 •

edited

Loading

chu11 commented Jun 21, 2024 •

edited

Loading

pm -q fails on large input queries #193

pm -q fails on large input queries #193

Comments

chu11 commented Jun 20, 2024 • edited Loading

garlick commented Jun 21, 2024 • edited Loading

chu11 commented Jun 21, 2024 • edited Loading

chu11 commented Jun 20, 2024 •

edited

Loading

garlick commented Jun 21, 2024 •

edited

Loading

chu11 commented Jun 21, 2024 •

edited

Loading