Replies: 4 comments 5 replies
-
@ltalirz in LSMO slack:
|
Beta Was this translation helpful? Give feedback.
-
Just adding the case for CSCS and a possible solution there. They have the node as a minimal resource unit and if you want to submit multiple jobs to a single node (or group of nodes) then you can use the GREASY meta-scheduler. In principle, the idea is that you prepare an input file with all your "tasks" and submit a single SLURM scripts that runs GREASY, who then reads this file and manages its own queue inside the nodes (until all tasks finish or you run out of time allocated by SLURM). This would still require some other pre-scheduler (like the FLUX one mentioned above) that gathers the jobs, prepares the GREASY inputs, and submits everything to SLURM. Now, apparently GREASY can adapt to changes in its input file, so if you add new tasks, they should be recognized and GREASY should start to process them. This allows to add new submissions to an existing instance of GREASY running on a SLURM cluster. Together with the parsing of its logfile to get information about the runs, I believe it would be possible to create a GREASY-SLURM plugin that does the following:
I think this solution could be simpler than coding a whole different scheduler that needs to be run and maintained separatedly (although a bit less versatile, since it relies on the fact that the task list file for GREASY can be updated, which I actually haven't verified yet so I'm not sure if there is any other "gotcha" I might be missing) |
Beta Was this translation helpful? Give feedback.
-
I think we found a software that acts as a meta scheduler and, from preliminary tests, seems to essentially do all we need: hyperqueue ( Features of
What we don't know is if how it behaves under heavy stress - we'll discover after the AiiDA plugin is ready. One note on running multiple runs inside the same job on eiger @ cscs:
Note that you need to do both things (only one will not be enough). In addition:
We can also start collecting feedback to report to the
|
Beta Was this translation helpful? Give feedback.
-
By the way, just as a note for others reading this: There are at least two different goals for task farming:
Goal 1 becomes irrelevant when you have control over the slurm instance and can therefore just enable node sharing (e.g. in a cloud context), while goal 2 is relevant both on HPC centers and in the cloud. Slurm jobarrays (leveraged by the scheduler plugin of aiida-dynamic-workflows) are a great way to address goal 2 [1] for the SLURM scheduler without needing to manage any additional software; hyperqueue is an additional tool that needs to be managed but allows to address both goals. [1] Strictly speaking, the individual members of the array job share many aspects of "primary" jobs but apparently it is much easier for SLURM to schedule them. |
Beta Was this translation helpful? Give feedback.
-
In a generic sense, this thread is about approaches to bundling multiple jobs into a single scheduler allocation.
Since the topic has come up repeatedly over the last couple of weeks, I'll try to collect relevant input from different sources here so that everyone has access.
status quo (2021-08)
from @sphuber
possible ways forward
Pointers to information on the topic:
Section 10 in report on AiiDA hackathon report from February 2020, involving @pzarabadip
@giovannipizzi mentions (2021-08)
aiida-fireworks-scheduler
plugin) mentions (2021-08)cc @jbweston
Beta Was this translation helpful? Give feedback.
All reactions