-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add resource manager support for Flux #798
Conversation
Signed-off-by: Chen Wang <wangvsa@gmail.com>
util/unifyfs/src/unifyfs-rm.c
Outdated
pclose(pipe_fp); | ||
|
||
// remove the trailing ']' | ||
nodelist_str[strlen(nodelist_str)-1] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the trailing ']' still here when only allocating one node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's missing, actually:
>>: flux alloc -q pdebug -N 1
>>: flux resource list --states=free -no '{nodelist}'
tioga20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to avoid parsing directly, another option is to rely on /bin/hostlist
.
>>: /bin/hostlist -e 'tioga[18-19,21,32]'
tioga18,tioga19,tioga21,tioga32
>>: /bin/hostlist -e 'tioga20'
tioga20
Though I think that command might just be available on LLNL systems and is not distributed with flux.
I think I've seen that flux has some python packages that parse hostlists, too. I can dig that up if you're interested.
I'm also fine with parsing the hostlist directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out the last character of the buffer returned from fgets is either \n or EOF. So my code actually removes that instead of ], and it just happens that the rest of the code still works fine even with ] in the buffer. Anyway, I have fixed this and now I will remove the last two characters all together.
Signed-off-by: Chen Wang <wangvsa@gmail.com>
Thanks, @wangvsa |
Description
Tioga uses Flux to schedule jobs and has limited support for srun. Also Tioga does not have the
scontrol
command, which we use to retrieve the allocated node list.This PR includes native support for Flux. It uses
flux run
to run clients and servers and usesflux resource
to retrieve the number of nodes and the node list.PS:
flux resource
returns a condensed node list, e.g., tioga[3-10, 12, 14]. The existingparse_hostfile()
function can't handle this format, I added some code to parse it manually.How Has This Been Tested?
Tested on Tioga with 1, 2, 4 nodes. Also tested
unifyfs-ls
,unifyfs-stage
and stage-in/out features.Types of changes
Checklist:
TODO
Unlike slurm where
SLURM_JOBID
can be used to determine a slurm allocation, flux only sets environment variables such asFLUX_JOB_ID
for eachflux run
job (a flux job is similar to a slurm step). At the time of executingunifyfs
(batch level), those variables have not been set yet.A short flux script example:
As a result, currently I use
FLUX_EXEC_PATH
to determine if the system has flux scheduler. I feel this is not optimal but I couldn't figure out a better way.