Task farming #5112

ltalirz · 2021-09-01T21:06:07Z

ltalirz
Sep 1, 2021

In a generic sense, this thread is about approaches to bundling multiple jobs into a single scheduler allocation.

Since the topic has come up repeatedly over the last couple of weeks, I'll try to collect relevant input from different sources here so that everyone has access.

status quo (2021-08)

from @sphuber

[...] in AiiDA each simulation/calculation on a remote machine (a calcjob in AiiDA lingo) is mapped onto its own allocation from the scheduler.
Each calcjob is its own self-contained process that is fully isolated from all others.
During its life cycle, it will first create a working directory on the remote machine (often the scratch space).
Then it will create the necessary input files, including a batch submission script which is customized to the scheduler running on the remote resource.
The script contains the necessary predirectives to allocate the required resources and other options, loads some modules if necessary and then executes a single command, e.g., srun -n 12 some_command.x < input.in.
This script is then submitted to the scheduler, e.g. sbatch aiidasubmit.sh and the daemon will parse the job id from the response and then poll the scheduler intermittently to check whether the job is done.
When the job is done, the daemon retrieves the files created in the workspace and stores them locally in AiiDA's data store, optionally parsing some info and storing it in a RDMS for efficient querying.

possible ways forward

Pointers to information on the topic:

Section 10 in report on AiiDA hackathon report from February 2020, involving @pzarabadip
@giovannipizzi mentions (2021-08)

One practical way to solve this (that maybe is what you have in mind?) would be that AiiDA submits jobs, rather than to SLURM directly, to a "meta-scheduler" (maybe flux could play this role?) where essentially AiiDA creates all the files, and a submission script with the required resources, and submits to this meta-scheduler; and then the meta-scheduler will have some policies to decide how to put together jobs.

Example:

when a job arrives, put it in an internal queue but don't submit it

when there are many jobs requiring similar resources and enough of them, submit them in a single SLURM job (taking care of running the correct script for each of them, in the appropriate folder); when the slurm job finishes, set all the meta-scheduler jobs as completed as well

One can also add rules like "if a configurable timeout has passed and not enough jobs have been submitted to the meta-scheduler, submit all those that are in the meta-scheduler queue to SLURM anyway, even if not at 'full capacity' ", so that jobs don't remain indefinitely in the queue (e.g. if you batch in groups of 10, and you submit in AiiDA 96 jobs, you also want the last 6 to run at some point without explicit action by the user). A sensible timeout depends on the workload, but e.g. 10 minutes could be a sensible choice.

Furthermore, rather than allowing any resource request in the meta-scheduler, one could implement it using a ParEvnJobResource in AiiDA (like SGE-like schedulers), rather than the most common NodeNumberJobResource (like SLURM); in the former, you can just submit and ask for resources picking from a finite list of strings (parallel_envs), if I understand correctly how SGE works. This will make it easier to batch together multiple submissions as you can decide to define only parallel_envs for full node, half node, quarter node, ... and few variants of memory and wall time requirements, and then define how to batch them together to optimally fill nodes on the supercomputer.
Otherwise you could get a job with 2 nodes, little memory and 24h wall time; another with half a node, a lot of memory and 2h wall time; etc., and defining a "grouping/batching" policy would be very tough.

@jameshcorbett (LLNL, working on the FLUX scheduler) mentions (2021-08)

The meta-scheduler approach that Giovanni detailed is definitely the approach I had in mind. Since Flux is actually a sophisticated and configurable scheduler, it does fit the role of meta-scheduler fairly well. Flux is both a wholesale replacement for resource managers like Slurm, but it is also designed to be run as a single-user resource manager running inside of a Slurm/LSF/PBS job. So, deciding how and when to group jobs together into a meta-job would be difficult and probably require a fair amount of code on AiiDA’s part, but once you make the decision that jobs A, B, and C will be grouped together, you could just submit them all to Flux (inside of the ABC meta-job) and trust that they would be scheduled and executed appropriately.

Unfortunately I really don’t have much time to spare on this (again sorry it took me so long to respond)---I work on three different projects at LLNL, and one of them I just joined two weeks ago, so I’m still getting used to it. But I know that Dong, Stephen and I would all be happy to help or provide advice if you did choose to use Flux in any way.

@zhubonan (author of aiida-fireworks-scheduler plugin) mentions (2021-08)

[...] you can achieve what you want (e.g. running many “small” AiiDA calculation inside some limited numbers of big/long cluster job) by the aiida-fireworks-scheduler plugin (https://github.com/zhubonan/aiida-fireworks-scheduler, full disclosure – I am the author).

This package does require some additional setup though. What it does is that instead of submitting the jobs directly to the scheduler of the cluster, they get submitted to another workflow engine (fireworks) acting as a meta-scheduler. Each cluster job can then run multiple of such workflows in series or in parallel. Afterwards, job retrieval and parsing are handled by AiiDA as usual. Using this approach, the user is responsible for submitting the cluster job (for running fireworks) manually, but this is something we seek automate in the future within AiiDA. If you have any question about the plugin, please feel free to open issues on Github.

cc @jbweston

kjappelbaum · 2021-09-02T06:24:11Z

kjappelbaum
Sep 2, 2021

@ltalirz in LSMO slack:

There are some open questions concerning the user interface.
AiiDA tends to store everything in side the JobCalculation object, i.e. one could think of storing the farming info in options->resources. https://aiida.readthedocs.io/projects/aiida-core/en/latest/working/calculations.html#options
However, the concept of farming is different in that instead of submitting calculations one by one, one might actually want to pass a whole batch of calculations to submit.
Also, those calculations necessarily should share resources (+ a number of other properties that are usually stored in options).

0 replies

ramirezfranciscof · 2021-09-15T09:31:35Z

ramirezfranciscof
Sep 15, 2021

Just adding the case for CSCS and a possible solution there. They have the node as a minimal resource unit and if you want to submit multiple jobs to a single node (or group of nodes) then you can use the GREASY meta-scheduler.

In principle, the idea is that you prepare an input file with all your "tasks" and submit a single SLURM scripts that runs GREASY, who then reads this file and manages its own queue inside the nodes (until all tasks finish or you run out of time allocated by SLURM). This would still require some other pre-scheduler (like the FLUX one mentioned above) that gathers the jobs, prepares the GREASY inputs, and submits everything to SLURM.

Now, apparently GREASY can adapt to changes in its input file, so if you add new tasks, they should be recognized and GREASY should start to process them. This allows to add new submissions to an existing instance of GREASY running on a SLURM cluster. Together with the parsing of its logfile to get information about the runs, I believe it would be possible to create a GREASY-SLURM plugin that does the following:

When you submit a job to this scheduler, it checks if there is some GREASY instance running on the SLURM cluster available, and if there is not it submit ones.
If there is, it submits the job to the GREASY instance (by adding a new line to the task input file).
It then checks the logfile for updates on the status of the task.

I think this solution could be simpler than coding a whole different scheduler that needs to be run and maintained separatedly (although a bit less versatile, since it relies on the fact that the task list file for GREASY can be updated, which I actually haven't verified yet so I'm not sure if there is any other "gotcha" I might be missing)

0 replies

giovannipizzi · 2021-12-09T18:02:55Z

giovannipizzi
Dec 9, 2021
Maintainer

I think we found a software that acts as a meta scheduler and, from preliminary tests, seems to essentially do all we need: hyperqueue (hq in short).
I'll report below some results from preliminary testing.
The plan is that @mbercx will prepare a new scheduler plugin aiida-hyperqueue, and then Louis will stress test it.

Features of hq:

developed by a HPC centre, so it seems to be careful in doing things in the right way (security, stress on the HPC, ..)
super easy to install (just download the binary)
works also when there are multiple login nodes
the interface is very simple and similar to slurm (hq submit script.sh, hq jobs (actually hq jobs waiting running to get only the ones that didn't finish), hq cancel JOB-ID, ...).
Once you submit, it can automatically start submitting worker jobs, that will sit in the queue and as soon as there's job to do, they start picking up the jobs and running them.
Workers know how to discover the number of CPUs, remaining wall time (can interact natively with SLURM and PBS), ...
Workers are able to "pack" multiple runs inside the same scheduler job (e.g. run 4 32-cores jobs inside a single 128 node), and continue picking up jobs until the time is over, reducing the waiting time in the queue.

What we don't know is if how it behaves under heavy stress - we'll discover after the AiiDA plugin is ready.

One note on running multiple runs inside the same job on eiger @ cscs:

you need to pass the correct amount of memory to srun (e.g. if you want to run 4 jobs, you need to ask for less than 1/4 of the total RAM with e.g. --mem=20G) otherwise the scheduler will hold all other jobs
you also need to pass --overlap to disable the default queue behaviour of exclusive usage

Note that you need to do both things (only one will not be enough).

In addition:

at the beginning, the performance seemed to be bad when sharing the node; the culprit turned out to be that one has to turn off OpenMP with Quantum ESPRESSO (export OMP_NUM_THREADS=1). Otherwise, performance (even with a single run of 32 cores on the 128 nodes will be 10x slower. With this fixed, the performance is quite good.
hq detects all 256 CPUs (including hyperthreading). To be tested: starting the workers telling to only use 128 CPUs (to avoid that 8 jobs of 32 CPUs start to run on the same node).

We can also start collecting feedback to report to the hq developers:

have a parsable --raw option for hq jobs (now it's formatted as a table but makes parsing a bit annoying)
understand if it's possible to have multiple automatic allocation of jobs in the queue with hq alloc with different "features" (e.g. a GPU vs. CPU one) and select to which one to send the jobs
similarly, question on how to deal with e.g. sub-node jobs vs. multiple node jobs: do we need to have multiple hq alloc?
is it possible to specify the command line options directly in the submit script, like most schedulers do with e.g. #HQ directives? (currently we have to work around this, because AiiDA assumes one can always write a submission script with all requirements inside it, and pass nothing on the command line except the script itself)

3 replies

jbweston Dec 9, 2021

Very cool!

One question I would have is how robust the system is to nodes dying; what layer of the stack will be responsible for making sure stuff gets resubmitted?

giovannipizzi Dec 9, 2021
Maintainer

I didn't test this (we'll try once the plugin is available?) but hq has job resubmission features: https://it4innovations.github.io/hyperqueue/stable/jobs/failure/

zhubonan Dec 10, 2021

you need to pass the correct amount of memory to srun (e.g. if you want to run 4 jobs, you need to ask for less than 1/4 of the total RAM with e.g. --mem=20G) otherwise the scheduler will hold all other jobs

Ha I see this is the missing piece - yes the CPUs should be tied up with some RAM, both of them are resources to be allocated. The default is up to the scheduler configuration. If srun does not find the resources, it will hold until there is one, which result in an apparent serial way of executing each job step.

ltalirz · 2021-12-09T20:09:09Z

ltalirz
Dec 9, 2021
Author

By the way, just as a note for others reading this:

There are at least two different goals for task farming:

Some HPC centers do not allow two separate "primary" jobs (e.g. slurm jobs) to share a node. If one wants to run jobs that can't efficiently make use of a full node, one therefore is wasting resources, unless one can bundle multiple jobs into one "primary" job.
Schedulers like SLURM struggle to deal with very large numbers of "primary" jobs (e.g. 1 million, but I guess already below - others are welcome to comment on their experience). Bundling multiple jobs into one "primary" job takes burden off the scheduler and makes it possible to run many more jobs.

Goal 1 becomes irrelevant when you have control over the slurm instance and can therefore just enable node sharing (e.g. in a cloud context), while goal 2 is relevant both on HPC centers and in the cloud.

Slurm jobarrays (leveraged by the scheduler plugin of aiida-dynamic-workflows) are a great way to address goal 2 [1] for the SLURM scheduler without needing to manage any additional software; hyperqueue is an additional tool that needs to be managed but allows to address both goals.

[1] Strictly speaking, the individual members of the array job share many aspects of "primary" jobs but apparently it is much easier for SLURM to schedule them.

2 replies

giovannipizzi Dec 9, 2021
Maintainer

@ltalirz thanks - a third goal that is actually also addressed by hq is for systems with long queue times. hq will put a few more jobs in the queue (typically up to 4, but configurable), and keep them alive for some timeout (default = 5 minutes) even if no new jobs arrive. In this way, there's higher chance that a new job submitted just a few minutes after another one enters immediately into the running state, rather than waiting for the job to stay for long time in the queue. This might actually be beneficial with AiiDA model where workflows are managed by AiiDA and each step is an independent submission.

zhubonan Dec 10, 2021

A potentially fourth goal (or benefit) is that some clusters are configured such only a small number of jobs can be queue. Task farming allows to get around this restriction by submitting a few big jobs which runs many smaller jobs. Such tight restriction is less common these days though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AiiDA team

Task farming #5112

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

AiiDA team

Task farming #5112

ltalirz Sep 1, 2021

status quo (2021-08)

possible ways forward

Replies: 4 comments · 5 replies

kjappelbaum Sep 2, 2021

ramirezfranciscof Sep 15, 2021

giovannipizzi Dec 9, 2021 Maintainer

jbweston Dec 9, 2021

giovannipizzi Dec 9, 2021 Maintainer

zhubonan Dec 10, 2021

ltalirz Dec 9, 2021 Author

giovannipizzi Dec 9, 2021 Maintainer

zhubonan Dec 10, 2021

ltalirz
Sep 1, 2021

Replies: 4 comments 5 replies

kjappelbaum
Sep 2, 2021

ramirezfranciscof
Sep 15, 2021

giovannipizzi
Dec 9, 2021
Maintainer

giovannipizzi Dec 9, 2021
Maintainer

ltalirz
Dec 9, 2021
Author

giovannipizzi Dec 9, 2021
Maintainer