-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pm -q fails on large input queries #193
Comments
I don't know if it will fix the hostlist error but we might want to consider pulling in the hostlist implementation from flux since it's actually had some attention paid to it in the last decade. I don't think the NOTE: comment is about scanf itself. it's about the source and destination buffers. It's say the source is max CP_LINEMAX so it's OK to sscanf %s into another CP_LINEMAX buffer. So I think increasing CP_LINEMAX should be fine (and would have been fine in 2002 also). |
doubling the CP_LINEMAX to 16K seemed to solve the above problems. So for the worst case input (every other node in a cluster hostrange) the limit is about a 3200 node cluster with a 8K CP_LINEMAX buffer (b/c the bug was hit around node 4275. note the example above started at node 1000 to make the host number length consistent throughout, so it's really a bit more than 3200 nodes). 3200 nodes is certainly smaller than the biggest HPC clusters out there nowadays according to Google. Soooo 16K buffer - worst case of 6400 node cluster I'd say lets up it to 128K to future proof this for quite some time. Edit: and obviously divide by 2 if someone is doing dumb things with hostlist ... i.e. |
Problem: On very larger clusters command line operations of a large number of nodes (e.g. node[1,3,5,...,4997,4999]) can exceed internal buffers, leading to errors. Increase the internal maximum line length from 8K to 128K, which should handle any HPC cluster in existence today and hopefully many more into the future. Fixes chaos#193
this works
but these don't
(FYI the bash command above leads to
for the -L 1638 case)
The "Command too long" comes from exceeding CP_LINEMAX on the client side.
The hostlist error comes from:
which is also using a CP_LINEMAX buffer.
I would like to think just doubling or tripling the CP_LINEMAX will be fine, although ...
although that comment is from 2002 and I would like to think longer strings are ok now (a cursory googling did not give me an immediate answer).
The text was updated successfully, but these errors were encountered: