Ray tune & friends on SLURM

Hyperparameter optimization tryout

📝 Description

This repository demonstrates/tests hyperparameter optimization with the following frameworks:

Ray tune as parent framework and to start jobs with SLURM
Optuna to suggest the hyperaprameters
Wandb (weights & measures) to log and visualize the results

Note If you want to see this tech stack in an actual use case, see the GNN tracking Hyperparameter Optimization repository.

📦 Installation

Use the conda environment, THEN pip install the package.

🔥 Running it!

First test without batch system

First run src/rtstest/dothetune.py (no batch submission) to also download the data file (because no internet connection on the compute nodes)

Option 1: All-in-one

For a single batch jobs that uses multiple nodes to start both the head node and the works, see slurm/all-in-one. While this is the example used in the ray documentation, it might not be the best for most use cases, as it relies on having enough available nodes directly available for enough time to complete all requested trials.

Live syncing to wandb

Because the compute nodes usually do not have internet, we need a separate tool for this. See the documentation of wandb-osh for how to start the syncer on the head node.

Option 2: Head node and worker nodes

Here, we start the ray head on the head (login) node and then use batch submission to start worker nodes asynchronously. Follow the following steps

Run slurm/head_workers/start-on-headnode.sh and note down the IP and redis password that are printed out
Submit several batch jobs sbatch slurm/head_workers/start-on-worker.slurm <IP> <REDIS PWD>
Start your tuning script on the head node: slurm/head_workers/start-program.sh <IP> <REDIS PWD>

Note In my HPO scripts at my main ML project I instead write out the IP and password to files in my home directory and have dependent scripts read from there rather than passing them around on the command line.

Once the batch jobs for the workers start running, you should see activity in the tuning script output.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github		.github
readme_assets		readme_assets
slurm		slurm
src/rtstest		src/rtstest
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
README.md		README.md
codespell.txt		codespell.txt
conda.yml		conda.yml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ray tune & friends on SLURM

📝 Description

📦 Installation

🔥 Running it!

First test without batch system

Option 1: All-in-one

Live syncing to wandb

Option 2: Head node and worker nodes

About

Releases

Packages

Contributors 3

Languages

License

klieret/ray-tune-slurm-demo

Folders and files

Latest commit

History

Repository files navigation

Ray tune & friends on SLURM

📝 Description

📦 Installation

🔥 Running it!

First test without batch system

Option 1: All-in-one

Live syncing to wandb

Option 2: Head node and worker nodes

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages