Skip to content

Commit

Permalink
ADD: SearchAlgo Optuna w own search space (#27)
Browse files Browse the repository at this point in the history
* add simple optuna

* update instructions

* exposure max_conc_trials and fix points for optuna

* fix optuna points

* remove points to eval and add debug for dashboard

* simplify set-up on 1 node

* minor comment changes

* add own optuna search space

* small comment adjustment

* add last slurm script as reference
  • Loading branch information
adamovanja authored Aug 16, 2024
1 parent 7c6061e commit 5644411
Show file tree
Hide file tree
Showing 16 changed files with 582 additions and 506 deletions.
71 changes: 53 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,48 +15,83 @@ conda activate ritme
make dev
```

## Model training locally
To train models with a defined configuration in `q2_ritme/config.json` run:
## Model training
The model configuration is defined in `q2_ritme/run_config.json`. If you want to parallelise the training of different model types, we recommend training each model in a separate experiment. If you decide to run several model types in one experiment, be aware that the model types are trained sequentially. So, this will take longer to finish.

Once you have trained some models, you can check the progress of the trained models in the tracking software you selected (see section #model-tracking).

To define a suitable model configuration, please find the description of each variable in `q2_ritme/run_config.json` here:

| Parameter | Description |
|-----------|-------------|
| experiment_tag | Name of the experiment. |
| host_id | Column name for unique host_id in the metadata. |
| target | Column name of the target variable in the metadata. |
| ls_model_types | List of model types to explore sequentially - options include "linreg", "trac", "xgb", "nn_reg", "nn_class", "nn_corn" and "rf". |
| models_to_evaluate_separately | List of models to evaluate separately during iterative learning - only possible for "xgb", "nn_reg", "nn_class" and "nn_corn". |
| num_trials | Total number of trials to try per model type: the larger this value the more space of the complete search space can be searched. |
| max_cuncurrent_trials | Maximal number of concurrent trials to run. |
| path_to_ft | Path to the feature table file. |
| path_to_md | Path to the metadata file. |
| path_to_phylo | Path to the phylogenetic tree file. |
| path_to_tax | Path to the taxonomy file. |
| seed_data | Seed for data-related random operations. |
| seed_model | Seed for model-related random operations. |
| test_mode | Boolean flag to indicate if running in test mode. |
| tracking_uri | Which platform to use for experiment tracking either "wandb" for WandB or "mlruns" for MLflow. See #model-tracking for set-up instructions. |
| train_size | Fraction of data to use for training (e.g., 0.8 for 80% train, 20% test split). |

### Local training
To locally train models with a defined configuration in `q2_ritme/config.json` run:
````
python q2_ritme/run_n_eval_tune.py --config q2_ritme/run_config.json
./launch_local.sh q2_ritme/config.json
````

Once you have trained some models, you can check the progress of the trained models by launching `mlflow ui --backend-store-uri experiments/mlruns`.

To evaluate the best trial (trial < experiment) of all launched experiments, run:
To evaluate the best trial (trial < experiment) of all launched experiments locally, run:
````
python q2_ritme/eval_best_trial_overall.py --model_path "experiments/models"
````

## Model training on HPC with slurm:
Edit file `launch_slurm_cpu.sh` and then run
### Training with slurm on HPC
To train a model with slurm on 1 node, edit the file `launch_slurm_cpu.sh` and then run
````
sbatch launch_slurm_cpu.sh
````
If you (or your collaborators) plan to launch multiple jobs on the same infrastructure you should set the variable `JOB_NB` in `launch_slurm_cpu.sh` accordingly. This variable makes sure that the assigned ports don't overlap (see [here](https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html#slurm-networking-caveats)).

If you are using SLURM and get the following error returned: "RuntimeError: can't start new thread"
it is probably caused by your hardware. Try decreasing the CPUs allocated to the job and/or decrease the variable `max_concurrent_trials` in `tune_models.py`.
If you are using SLURM and your error message contains this: "The process is killed by SIGKILL by OOM killer due to high memory usage", you should increase the assigned memory per CPU (`--mem-per-cpu`).
To train a model with slurm on multiple nodes or to enable running of multiple ray instances on the same HPC, you can use: `sbatch launch_slurm_cpu_multi_node.sh`. If you (or your collaborators) plan to launch multiple jobs on the same infrastructure you should set the variable `JOB_NB` in `launch_slurm_cpu_multi_node.sh` accordingly. This variable makes sure that the assigned ports don't overlap (see [here](https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html#slurm-networking-caveats)). Currently, the script allows for 3 parallel ray slurm jobs to be executed.
**Note:** training a model with slurm on multiple nodes can be very specific to your infrastructure. So you might need to adjust this bash script to your set-up.

#### Some common slurm errors:
If you are using SLURM and ...
* ... get the following error returned: "RuntimeError: can't start new thread" it is probably caused by thread limits of the cluster. You can try increasing the number of threads allowed `ulimit -u` in `launch_slurm_cpu.sh` and/or decrease the variable `max_concurrent_trials` in `q2_ritme/config.json`. In case neither helps, it might be worth contacting the cluster administrators.

* ... your error message contains this: "The process is killed by SIGKILL by OOM killer due to high memory usage", you should increase the assigned memory per CPU (`--mem-per-cpu`) in `launch_slurm_cpu.sh`.

## Model tracking
In the config file you can choose to track your trials with MLflow (tracking_uri=="mlruns") or with WandB (tracking_uri=="wandb"). In case of using WandB you need to store your `WANDB_API_KEY` & `WANDB_ENTITY` as a environment variable in `.env`. Make sure to ignore this file in version control (add to `.gitignore`)!
In the config file you can choose to track your trials with MLflow (tracking_uri=="mlruns") or with WandB (tracking_uri=="wandb").

### MLflow
In case of using MLflow you can view your models with `mlflow ui --backend-store-uri experiments/mlruns`. For more information check out the [official MLflow documentation](https://mlflow.org/docs/latest/index.html).

### WandB
In case of using WandB you need to store your `WANDB_API_KEY` & `WANDB_ENTITY` as a environment variable in `.env`. Make sure to ignore this file in version control (add to `.gitignore`)!

The `WANDB_ENTITY` is the project name you would like to store the results in. For more information on this parameter see the official webpage from WandB initialization [here](https://docs.wandb.ai/ref/python/init).
The `WANDB_ENTITY` is the project name you would like to store the results in. For more information on this parameter see the official webpage for initializing WandB [here](https://docs.wandb.ai/ref/python/init).

Also if you are running WandB from a HPC, you might need to set the proxy URLs to your respective URLs by exporting these variables:
```
export HTTPS_PROXY=http://proxy.example.com:8080
export HTTP_PROXY=http://proxy.example.com:8080
````
## Code test coverage
## Developers topics - to be removed prior to publication
### Code test coverage
To run test coverage with Code Gutters in VScode run:
````
pytest --cov=q2_ritme q2_ritme/tests/ --cov-report=xml:coverage.xml
````
## Call graphs
### Call graphs
To create a call graph for all functions in the package, run the following commands:
````
pip install pyan3==1.1.1
Expand All @@ -65,6 +100,6 @@ pyan3 q2_ritme/**/*.py --uses --no-defines --colored --grouped --annotated --svg
````
(Note: different other options to create call graphs were tested such as code2flow and snakeviz. However, these although properly maintained didn't directly output call graphs such as pyan3 did.)
## Background
### Why ray tune?
### Background
#### Why ray tune?
"By using tuning libraries such as Ray Tune we can try out combinations of hyperparameters. Using sophisticated search strategies, these parameters can be selected so that they are likely to lead to good results (avoiding an expensive exhaustive search). Also, trials that do not perform well can be preemptively stopped to reduce waste of computing resources. Lastly, Ray Tune also takes care of training these runs in parallel, greatly increasing search speed." [source](https://docs.ray.io/en/latest/tune/examples/tune-xgboost.html#tune-xgboost-ref)
19 changes: 19 additions & 0 deletions launch_local.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash

if [ $# -ne 1 ]; then
echo "Usage: $0 <CONFIG>"
exit 1
fi

CONFIG=$1
PORT=6379

if ! nc -z localhost $PORT; then
echo "Starting Ray on port $PORT"
ray start --head --port=$PORT --dashboard-port=0
else
echo "Ray is already running on port $PORT"
fi

OUTNAME=$(echo "$CONFIG" | sed -n 's/.*\/\([^/]*\)\.json/\1/p')
python q2_ritme/run_n_eval_tune.py --config "$CONFIG" > x_"$OUTNAME"_out.txt 2>&1
83 changes: 2 additions & 81 deletions launch_slurm_cpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
#SBATCH --job-name="run_config"
#SBATCH -A partition_name
#SBATCH --nodes=1
#SBATCH --cpus-per-task=20
#SBATCH --time=23:59:59
#SBATCH --cpus-per-task=100
#SBATCH --time=119:59:59
#SBATCH --mem-per-cpu=4096
#SBATCH --output="%x_out.txt"
#SBATCH --open-mode=append
Expand All @@ -17,91 +17,12 @@ echo "SLURM_GPUS_PER_TASK: $SLURM_GPUS_PER_TASK"
# ! USER SETTINGS HERE
# -> config file to use
CONFIG="q2_ritme/run_config.json"
# -> count of this concurrent job launched on same infrastructure
# -> only these values are allowed: 1, 2, 3 - since below ports are
# -> otherwise taken or not allowed
JOB_NB=1

# if your number of threads are limited increase as needed
ulimit -u 60000
ulimit -n 524288
# ! USER END __________

# __doc_head_address_start__
# script was edited from:
# https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html

# Getting the node names
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)

head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
head_node_ip=${ADDR[1]}
else
head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi
# __doc_head_address_end__

# __doc_head_ray_start__
port=$((6378 + JOB_NB))
node_manager_port=$((6600 + JOB_NB * 100))
object_manager_port=$((6601 + JOB_NB * 100))
ray_client_server_port=$((1 + JOB_NB * 10000))
redis_shard_ports=$((6602 + JOB_NB * 100))
min_worker_port=$((2 + JOB_NB * 10000))
max_worker_port=$((9999 + JOB_NB * 10000))
dashboard_port=$((8265 + JOB_NB))

ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
ray start --head --node-ip-address="$head_node_ip" \
--port=$port \
--node-manager-port=$node_manager_port \
--object-manager-port=$object_manager_port \
--ray-client-server-port=$ray_client_server_port \
--redis-shard-ports=$redis_shard_ports \
--min-worker-port=$min_worker_port \
--max-worker-port=$max_worker_port \
--dashboard-port=$dashboard_port \
--num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK:-0}" --block &
# __doc_head_ray_end__

# __doc_worker_ray_start__
# optional, though may be useful in certain versions of Ray < 1.0.
sleep 10

# number of nodes other than the head node
worker_num=$((SLURM_JOB_NUM_NODES - 1))

for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
echo "Starting WORKER $i at $node_i"
srun --nodes=1 --ntasks=1 -w "$node_i" \
ray start --address "$ip_head" \
--num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK:-0}" --block &
sleep 5
done
# __doc_worker_ray_end__

# Output the dashboard URL
dashboard_url="http://${head_node_ip}:${dashboard_port}"
export RAY_DASHBOARD_URL="$dashboard_url"
echo "Ray Dashboard URL: $RAY_DASHBOARD_URL"

# __doc_script_start__
python -u q2_ritme/run_n_eval_tune.py --config $CONFIG
sstat -j $SLURM_JOB_ID

Expand Down
110 changes: 110 additions & 0 deletions launch_slurm_cpu_multi_node.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
#!/bin/bash

#SBATCH --job-name="run_config"
#SBATCH -A partition_name
#SBATCH --nodes=1
#SBATCH --cpus-per-task=100
#SBATCH --time=23:59:59
#SBATCH --mem-per-cpu=4096
#SBATCH --output="%x_out.txt"
#SBATCH --open-mode=append

set -x

echo "SLURM_CPUS_PER_TASK: $SLURM_CPUS_PER_TASK"
echo "SLURM_GPUS_PER_TASK: $SLURM_GPUS_PER_TASK"

# ! USER SETTINGS HERE
# -> config file to use
CONFIG="q2_ritme/run_config.json"
# -> count of this concurrent job launched on same infrastructure
# -> only these values are allowed: 1, 2, 3 - since below ports are
# -> otherwise taken or not allowed
JOB_NB=2

# if your number of threads are limited increase as needed
ulimit -u 60000
ulimit -n 524288
# ! USER END __________

# __doc_head_address_start__
# script was edited from:
# https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html

# Getting the node names
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)

head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
head_node_ip=${ADDR[1]}
else
head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi
# __doc_head_address_end__

# __doc_head_ray_start__
port=$((6378 + JOB_NB))
node_manager_port=$((6600 + JOB_NB * 100))
object_manager_port=$((6601 + JOB_NB * 100))
ray_client_server_port=$((1 + JOB_NB * 10000))
redis_shard_ports=$((6602 + JOB_NB * 100))
min_worker_port=$((2 + JOB_NB * 10000))
max_worker_port=$((9999 + JOB_NB * 10000))
dashboard_port=$((8265 + JOB_NB))

ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
ray start --head --node-ip-address="$head_node_ip" \
--port=$port \
--node-manager-port=$node_manager_port \
--object-manager-port=$object_manager_port \
--ray-client-server-port=$ray_client_server_port \
--redis-shard-ports=$redis_shard_ports \
--min-worker-port=$min_worker_port \
--max-worker-port=$max_worker_port \
--dashboard-port=$dashboard_port \
--num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK:-0}" --block &
# __doc_head_ray_end__

# __doc_worker_ray_start__
# optional, though may be useful in certain versions of Ray < 1.0.
sleep 10

# number of nodes other than the head node
worker_num=$((SLURM_JOB_NUM_NODES - 1))

for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
echo "Starting WORKER $i at $node_i"
srun --nodes=1 --ntasks=1 -w "$node_i" \
ray start --address "$ip_head" \
--num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK:-0}" --block &
sleep 5
done
# __doc_worker_ray_end__

# Output the dashboard URL
dashboard_url="http://${head_node_ip}:${dashboard_port}"
export RAY_DASHBOARD_URL="$dashboard_url"
echo "Ray Dashboard URL: $RAY_DASHBOARD_URL"

# __doc_script_start__
python -u q2_ritme/run_n_eval_tune.py --config $CONFIG
sstat -j $SLURM_JOB_ID

# get elapsed time of job
echo "TIME COUNTER:"
sacct -j $SLURM_JOB_ID --format=elapsed --allocations
31 changes: 31 additions & 0 deletions launch_slurm_cpu_own_ss.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/bin/bash

#SBATCH --job-name="r_optuna_own_ss_rf"
#SBATCH -A es_bokulich
#SBATCH --nodes=1
#SBATCH --cpus-per-task=100
#SBATCH --time=119:59:59
#SBATCH --mem-per-cpu=4096
#SBATCH --output="%x_out.txt"
#SBATCH --open-mode=append

set -x

echo "SLURM_CPUS_PER_TASK: $SLURM_CPUS_PER_TASK"
echo "SLURM_GPUS_PER_TASK: $SLURM_GPUS_PER_TASK"

# ! USER SETTINGS HERE
# -> config file to use
CONFIG="q2_ritme/r_optuna_own_ss_rf.json"

# if your number of threads are limited increase as needed
ulimit -u 60000
ulimit -n 524288
# ! USER END __________

python -u q2_ritme/run_n_eval_tune.py --config $CONFIG
sstat -j $SLURM_JOB_ID

# get elapsed time of job
echo "TIME COUNTER:"
sacct -j $SLURM_JOB_ID --format=elapsed --allocations
5 changes: 1 addition & 4 deletions q2_ritme/evaluate_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,10 +142,7 @@ def select(self, data, split):
# assign self.train_selected_fts to be able to run select on test set later
train_selected = select_microbial_features(
data,
self.data_config["data_selection"],
self.data_config["data_selection_i"],
self.data_config["data_selection_q"],
self.data_config["data_selection_t"],
self.data_config,
ft_prefix,
)
self.train_selected_fts = train_selected.columns
Expand Down
Loading

0 comments on commit 5644411

Please sign in to comment.