-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux-job: attach: support MPIR tool launch extension #5758
Conversation
b2532c9
to
bc505ab
Compare
Mentioning @lee218llnl since he may be interested or have a perspective on this part of the MPIR spec. |
I have an improvement for the MPIR rangelist representation that I think makes addition of the broker ranks to the {"hosts":[["corona",[[187,-47],[13,-47],[1,-47,1],[2,-47],[1,-47,2],[6,-47],[1,-47,1],[2,-47],[1,-47,4],[2,-47],[1,-47,34],[2,-47],
[1,-47,9]]]],"executables":[["sleep",[[-1,-3071]]]],"ids":[[0,3071]],"pids":[[1535168,47],[-535966,47],[-228557,47],
[686167,47],[-622563,47],[-129439,47],[2866245,47],[-1739757,47],[220055,47],[-428264,47],[217894,47],[48174,47],
[-56983,47],[-157167,47],[-242878,47],[348002,47],[141991,47],[-1051847,47],[853639,47],[279800,47],[-315681,47],
[-1396323,47],[-1368,47],[944,47],[-24426,47],[139,47],[-112,47],[-133,47],[-264,47],[141,47],[-19694,47],[19383,47],
[-3497,47],[548,47],[-5766,47],[-171,47],[-344,47],[1131,47],[-3202,47],[-243,47],[104,47],[39,47],[-98,47],[224,47],
[-186,47],[53,47],[-393,47],[98,47],[-1593,47],[-378,47],[97,47],[-113,47],[101,47],[-69,47],[-53,47],[214,47],[-369,47],
[149,47],[-369,47],[-1869,47],[-84,47,1],[-62,47],[-133,47]],"ranks":[[0,-47],[1,-47,62]]} The additional overhead is the |
bc505ab
to
ec13bda
Compare
ce3292f
to
b135759
Compare
Just removed WIP. However, once #5605 goes in, possibly this PR should be updated to use the shell rexec service instead of the broker service. So while an early review would be welcome, that small change could be forthcoming. |
Just a quick meta question - would it make sense to have a libmpir convenience library that concentrates this stuff and any unit tests? |
There are already unit tests in There is also the small |
Sure, I was just asking. It sort of worked out with the PMI stuff but good point about the special symbols. |
Yeah, we should revisit it perhaps in the future when not pressed for time. Maybe I'm overthinking the symbols... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor typos and a question for clarification. This is great to have done!
src/shell/mpir/rangelist.c
Outdated
@@ -25,16 +25,20 @@ | |||
* which is delta encoded from the previous entry (or 0 if this is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In "shell: mpir: compress repeating ranges in rangelist" commit message: s/commoon/common/
Also is that space in one of the ranges supposed to be there?
src/shell/mpir/rangelist.c
Outdated
static int increment_range_repeat (json_t *range) | ||
{ | ||
if (json_array_size (range) == 3) { | ||
json_int_t repeat = range_json_val (range, 2); | ||
return json_array_set_new (range, 2, json_integer (repeat + 1)); | ||
} | ||
return json_array_append_new (range, json_integer (1)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If either of the _new()
functions fails, the json_integer()
value is leaked.
src/cmd/job/mpir.c
Outdated
rank = idset_first (ranks); | ||
while (rank != IDSET_INVALID_ID) { | ||
flux_subprocess_t *p; | ||
if (!(p = flux_rexec (h, rank, 0, cmd, &ops))) | ||
log_err ("MPIR: failed to launch %s", tool_path); | ||
rank = idset_next (ranks, rank); | ||
} | ||
flux_cmd_destroy (cmd); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the subprocess objects leaked here purposefully because there is no way to return an error via the MPIR interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subprocess objects are destroyed in completion_cb()
during normal termination, and I notice there is a missing flux_subprocess_destroy()
in the state_cb
as well when the subprocess fails. And yeah, since there is no way or need to traverse the list of launched subprocesses at normal program exit, it didn't seem necessary to try to keep this list internally and then somehow do something at exit (maybe with an at_exit
handler?).
On abnormal termination of this command, one of the log_*_exit()
functions are usually called and everything is leaked.
@@ -15,6 +15,7 @@ | |||
#endif | |||
|
|||
#include <flux/core.h> | |||
#include <flux/optparse.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In "testsuite: switch mpir test program to optparse " commit message, s/--leade-rank/leader-rank/
Problem: Processes like `flux job attach` that use MPIR_Procdesc information may need to know the broker rank information for tasks in addition to or instead of hostnames. This information could be provided by the shell mpir procdesc response so just add it there.
Problem: There are no unit tests for the new proctable 'rank' member. Add tests to the proctable unit test.
Problem: The MPIR Process Acquisition Interface[1] Version 1.1 describes an optional tool daemon launch extension in Section 7.1, but flux-job attach does not support this extension. Support optional launch of tool daemons via MPIR_executable_path and MPIR_server_arguments as described in the MPIR spec. 1. https://www.mpi-forum.org/docs/mpir-specification-03-01-2018.pdf
Problem: The t/shell/mpir.c test program uses free arguments to pass the leader shell rank and service name to the test, but this is not flexible enough for future use of this program. Add `-r, --leader-rank=N` and `-s, --service=NAME` options to the test program, and drop the use of free arguments. Update the affected sharness test.
Problem: The shell/mpir test program only tests the shell proctable RPC, but does not support MPIR tool launch extensions as is supported by flux-job attach. Link t/shell/mpir with flux-job attach mpir code to share MPIR_proctable generation and also new MPIR tool launch extensions. Add a new option `-t, --tool-launch` to the test program for exercising these extensions.
Problem: Use of tab is standard in sharness tests, but t2610-job-shell-mpir.t uses 4 spaces instead. Replace spaces with tab where appropriate.
Problem: There are no tests of flux-job attach support for the MPIR tool daemon launch extension. Utilize the support in t/shell/mpir to drive testing of the code that supports MPIR_executable_path and MPIR_server_arguments.
Rebased on master. Use shell rexec service to launch tool daemons. |
This is approved, but I think there's an open question about how the subprocess objects are managed within the mpir module of Also, while the code itself is tested via the testsuite, there is no real-world debugger or tool that uses Therefore, I'm unsure whether this PR should be merged, or just closed and resurrected when there is a real need. Since the code is only activated when a debugger or tool has attached to |
It would be a shame to lose this work in case it is needed later (for example if DDT uses it). My vote is we merge it. Later, we should perhaps find somewhere to document the specific "tool support" features we have in flux-core so future people wanting to port a tool have someplace to start. |
Great idea! Maybe the debugging docs in flux-docs should be moved into flux-core and the MPIR support mechanism details added to that section. Ok, I'll set MWP here since there's low probability of anything bad happening as a result. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5758 +/- ##
==========================================
+ Coverage 83.50% 83.52% +0.01%
==========================================
Files 509 509
Lines 83703 83842 +139
==========================================
+ Hits 69897 70025 +128
- Misses 13806 13817 +11
|
This PR adds the minimum support for the MPIR tool daemon launch extensions documented in The MPIR Process Acquisition Interface Version 1.1 , Section 7.1. It also splits the MPIR code in
src/cmd/job/attach.c
into its own source file so that the implementation can be tested with the "shell" mpir test utility int/shell/mpir.c
.Currently the tool specified in
MPIR_executable_path
andMPIR_server_arguments
is launched with libsubprocess. We depend on the disconnect handler to clean up stray tool processes ifflux job attach
is terminated or otherwise exits early.This is currently a WIP because it adds a list of task/broker rank pairs to the shell
proctable
response since this was the simplest approach for now, but there's probably a more efficient way.This PR is based on top of #5756