Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install tasks via SLURM #686

Open
Tracked by #697
tcompa opened this issue May 11, 2023 · 4 comments
Open
Tracked by #697

Install tasks via SLURM #686

tcompa opened this issue May 11, 2023 · 4 comments

Comments

@tcompa
Copy link
Collaborator

tcompa commented May 11, 2023

Refs:

After reviewing these issues, we (me and @mfranzon) propose to switch to another way of collecting task (which was already mentioned in the past), namely one wher ethe venv-related commands (venv creation and pip commands) are executed on the same machine where the tasks will be executed.

Briefly, task collection would have three phases:

  1. Preliminary checks and configuration (as part of fractal-server)
  2. venv/pip commands
  3. Final checks and db operations (as part of fractal-server)

Steps 1 and 3 should remain very similar to what they are now.
Step 2 should be heavily refactored. Right now it consists of a series of subprocess.run commands, with their I/O handled in python. All these commands are executed on the machine where fractal-server runs, which is clearly a problem (see issues above, but also the possible incompatibility in system libraries).
In the future version, step 2 will be transformed into a bash script, similar to the following prototype:

#!/bin/bash

# Variables to be filled in fractal-server
PYTHON=python3
FRACTAL_TASKS_DIR=/tmp/artifacts
PACKAGE=devtools
VERSION=0.1
USER=fractal
EXTRAS="[]"

PKG_ENV_DIR=$FRACTAL_TASKS_DIR/.${USER}/${PACKAGE}${VERSION}
VENVPYTHON=${PKG_ENV_DIR}/bin/python
LOGFILE=${PKG_ENV_DIR}/collection.log
SUCCESFILE=${PKG_ENV_DIR}/pkg_location.log

# Create venv
$PYTHON -m venv $PKG_ENV_DIR --copies 2>1 >> $LOGFILE

# Update pip
$VENVPYTHON -m pip install pip --upgrade >> $LOGFILE

# Install package
$VENVPYTHON -m pip install ${PACKAGE}${EXTRAS}==${VERSION} >> $LOGFILE

# Show
$VENVPYTHON -m pip show ${PACKAGE} >> $LOGFILE
$VENVPYTHON -m pip show ${PACKAGE} | grep "Location: " > $SUCCESFILE

And, most importantly, this script will be executed via FractalSlurmExecutor (for the slurm backend), or via a standard ThreadPoolExecutor (for the local backend). Thus a SLURM job will install the tasks while executing on a SLURM node.

Some notes:

  1. Relevant paths will still have to be on a share, because both the server and all users need access.
  2. This change will remove the current fail-fast behavior, where an invalid-manifest error (for instance) leads to an early failure (i.e. it does not have to first install a heavy package and then fail). That's an acceptable trade-off, in our opinion.
  3. Once this is implemented and tested, [FMI deployment] task environment python executable simlink to server python leads to issues #556 should be automatically fixed.
  4. As per To discuss: Should we drop the python_version flexibility from task collection? #659, with this refactor we could then easily switch to a situation where the SLURM configuration file also points to several python paths (rather than just a single one in FRACTAL_SLURM_WORKER_PYTHON), each one for a given version. For instance the Fractal admin could include something like
{
  "python3.9": "/usr/bin/python3.9",
  "python3.10": "/data/homes/fractal/miniconda3/env/py310/bin/python"
}

This is to be reviewed and re-discussed together, but we think that the current task collection is wrong, and it just works by accident (mainly because the server machine is very similar to the cluster nodes).

@jluethi
Copy link
Collaborator

jluethi commented May 11, 2023

I generally like the idea a lot. Will review better next week. One first comment:
We’ll need to figure out which user runs this on the slurm cluster. At FMI, I’m not sure whether the Fractal user can submit to the cluster atm (to be tested).
And if we consider user specific tasks, should they be run by themselves or by the fractal user?

@jluethi
Copy link
Collaborator

jluethi commented May 16, 2023

Just to quickly confirm: The fractal user at FMI does not have access to submit jobs to the cluster.
I could make my own user admin as well though (or create an admin user that submits as my slurm user) and we could install tasks via that user.

@tcompa
Copy link
Collaborator Author

tcompa commented Nov 12, 2024

As of version 2.9.0 (in progress), the task lifecycle features are organized roughly as follows:

fractal_server/tasks/v2/
├── local
│   ├── collect.py
│   ├── deactivate.py
│   └── reactivate.py
├── ssh
│   ├── collect.py
│   ├── deactivate.py
│   └── reactivate.py

where the ssh/local modules are quite homogeneous in their structure. In principle we could expose an abstraction for each action (collect, deactivate, reactivate), which then takes a specific form for either local or ssh versions. However, the next step in this area will be to cover a third scenario (the slurm-based one) which will not fit in this scheme; thus we are keeping all modalities separated.


That said, we need to define the feature of "run task collection as a SLURM job" better.

The main questions are:

  1. Which user runs the SLURM job? The obvious options:
    • The fractal user who is running fractal-server. This makes things quite simpler, since there is less variability (a single user, always writing to the same folder). As a requirement, this user user must have access to the cluster, which is not always the case in the current ways we deploy fractal-server.
    • The individual Fractal user who made the task-collection API call. This is clearly more complex, at least in two directions:
      • When we run task collection, we need to know/check a few more variables (the user's ones, like the slurm_user and a folder where the task collection should go).
      • When we use tasks, we are assuming that the permissions the user chose (e.g. making the task available to a certain user group) reflect the on-disk permissions to the task files.
  2. Based on the answer in 1, we should define the folder (or folders) where the SLURM-based task collection will operate, and clarify who has access to it/them (notably: does the fractal user have read-write access to these folders, in case they are not fractal-owned?)
  3. On the technical side, and also depending on answers 1 and 2, we'll also decide whether we re-use the existing SLURM executor (with all its complexity and flexibility), or whether we implement something much smaller here. If we go with the latter, a few very relevant simplifications would be available:
    • Since no parallelization is involved, we could avoid introducing a concurrent.futures.Executor Python interface.
    • We would define and submit the SLURM job directly as a bash/SLURM script, rather than passing through Python and clusterfutures.
    • Depending on answer 2 above, we could make simpler assumptions about the access to the tasks folder (e.g. as in "fractal has read/write access to that folder"), which would remove or reduce the sudo-related complexity.
  4. (which does not need to run anything parallel and also dpe

Side note: the pattern which we choose here (re: SLURM user and relevant folders) will also be used for task-group deactivation and reactivation, on top of task collection.

We first need to review the main questions above (cc @jluethi), and then we can proceed with a first implementation. At a first look, the option of letting the fractal user run SLURM jobs seems the most appealing, but let's compare it with the real-life on-site requirements.

@jluethi
Copy link
Collaborator

jluethi commented Nov 13, 2024

High-level: Let's not do this now, but take it for further discussion (e.g. with Enrico)


Some content brainstorming:

On 1:
Needs to be the user's slurm user.
Only alternative: We come up with a fractal_service_user (not the server user with sudo rights) that handles environment creation. This user wouldn't have any sudo access.

When we use tasks, we are assuming that the permissions the user chose (e.g. making the task available to a certain user group) reflect the on-disk permissions to the task files.

Yes. Can we make on-disk permissions for this just very broad? => everyone can execute, every user can write to this task folder.
If we go the service user direction, we do not need to give broad write access.

On 2:
/path/to/fractal/general_deployment/.FRACTAL_TASKS/
Give everyone read, write & execute to this folder

Open question: Does the server keep track of whl files separately?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants