You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a feature request to add support for +e (empty relative node) to rankfiles. PR #720 (Rankfile per prun) does not cover the use-case with multiple concurrent prun invocations where you want to run multiple independent jobs in one DVM without sharing any nodes.
I think the latter use case needs support for +e relative node specifier. With +n, separate jobs end up allocated onto different slots on the same nodes.
Since jobs are independent it doesn't make sense to require constructing a set of rankfiles (one rankfile per job) that are aware of each other, i.e. one set of rankfiles per one particular set of jobs.
Desired:
cat arankfile
rank 0=+e0 slot=0
rank 1=+e1 slot=0
# ^ one rankfile, re-used for each job, by means of relative node specs
prte --daemonize
prun -n 2 --map-by rankfile:FILE=arankfile:NOLOCAL bash hostname-sleep.sh 10 &
b07n13
b07n14
prun -n 2 --map-by rankfile:FILE=arankfile:NOLOCAL bash hostname-sleep.sh 10 &
b07n15
b07n16
pterm
Actual:
The above syntax is rejected (only +nX is supported).
With the following rankfile, both jobs are allocated onto the same two nodes (as expected):
rank 0=+n0 slot=0
rank 1=+n1 slot=0
The text was updated successfully, but these errors were encountered:
@acolinisi It will probably be awhile before I can get to this - you are pretty savvy, so perhaps you might want to take a crack at it? The rankfile code is in the src/mca/rmaps/rank_file directory. The file itself gets parsed using flex, and the lexical directives are in rmaps_rank_file_lex.l. Currently, it only picks up the +n as a "relative node syntax" directive, so you'd need to add +e to that one.
You then need to extend the code in rmaps_rank_file.c starting at line 271 to account for the +e option. You can see how +n was handled, so it shouldn't be too difficult (I think).
This is a feature request to add support for +e (empty relative node) to rankfiles. PR #720 (Rankfile per prun) does not cover the use-case with multiple concurrent prun invocations where you want to run multiple independent jobs in one DVM without sharing any nodes.
I think the latter use case needs support for +e relative node specifier. With +n, separate jobs end up allocated onto different slots on the same nodes.
Since jobs are independent it doesn't make sense to require constructing a set of rankfiles (one rankfile per job) that are aware of each other, i.e. one set of rankfiles per one particular set of jobs.
Desired:
Actual:
The above syntax is rejected (only
+nX
is supported).With the following rankfile, both jobs are allocated onto the same two nodes (as expected):
The text was updated successfully, but these errors were encountered: