ADD: SearchAlgo Optuna w own search space (#27)

* add simple optuna * update instructions * exposure max_conc_trials and fix points for optuna * fix optuna points * remove points to eval and add debug for dashboard * simplify set-up on 1 node * minor comment changes * add own optuna search space * small comment adjustment * add last slurm script as reference
adamovanja · Aug 16, 2024 · 5644411 · 5644411
1 parent 7c6061e
commit 5644411
Show file tree

Hide file tree

Showing 16 changed files with 582 additions and 506 deletions.
diff --git a/README.md b/README.md
@@ -15,48 +15,83 @@ conda activate ritme
 make dev
 ```
 
-## Model training locally
-To train models with a defined configuration in `q2_ritme/config.json` run:
+## Model training
+The model configuration is defined in `q2_ritme/run_config.json`. If you want to parallelise the training of different model types, we recommend training each model in a separate experiment. If you decide to run several model types in one experiment, be aware that the model types are trained sequentially. So, this will take longer to finish.
+
+Once you have trained some models, you can check the progress of the trained models in the tracking software you selected (see section #model-tracking).
+
+To define a suitable model configuration, please find the description of each variable in `q2_ritme/run_config.json` here:
+
+| Parameter | Description |
+|-----------|-------------|
+| experiment_tag | Name of the experiment. |
+| host_id | Column name for unique host_id in the metadata. |
+| target | Column name of the target variable in the metadata. |
+| ls_model_types | List of model types to explore sequentially - options include "linreg", "trac", "xgb", "nn_reg", "nn_class", "nn_corn" and "rf". |
+| models_to_evaluate_separately | List of models to evaluate separately during iterative learning - only possible for "xgb", "nn_reg", "nn_class" and "nn_corn". |
+| num_trials | Total number of trials to try per model type: the larger this value the more space of the complete search space can be searched. |
+| max_cuncurrent_trials | Maximal number of concurrent trials to run. |
+| path_to_ft | Path to the feature table file. |
+| path_to_md | Path to the metadata file. |
+| path_to_phylo | Path to the phylogenetic tree file. |
+| path_to_tax | Path to the taxonomy file. |
+| seed_data | Seed for data-related random operations. |
+| seed_model | Seed for model-related random operations. |
+| test_mode | Boolean flag to indicate if running in test mode. |
+| tracking_uri | Which platform to use for experiment tracking either "wandb" for WandB or "mlruns" for MLflow. See  #model-tracking for set-up instructions. |
+| train_size | Fraction of data to use for training (e.g., 0.8 for 80% train, 20% test split). |
+
+### Local training
+To locally train models with a defined configuration in `q2_ritme/config.json` run:
 ````
-python q2_ritme/run_n_eval_tune.py --config q2_ritme/run_config.json
+./launch_local.sh q2_ritme/config.json
 ````
 
-Once you have trained some models, you can check the progress of the trained models by launching `mlflow ui --backend-store-uri experiments/mlruns`.
-
-To evaluate the best trial (trial < experiment) of all launched experiments, run:
+To evaluate the best trial (trial < experiment) of all launched experiments locally, run:
 ````
 python q2_ritme/eval_best_trial_overall.py --model_path "experiments/models"
 ````
 
-## Model training on HPC with slurm:
-Edit file `launch_slurm_cpu.sh` and then run
+### Training with slurm on HPC
+To train a model with slurm on 1 node, edit the file `launch_slurm_cpu.sh` and then run
 ````
 sbatch launch_slurm_cpu.sh
 ````
-If you (or your collaborators) plan to launch multiple jobs on the same infrastructure you should set the variable `JOB_NB` in `launch_slurm_cpu.sh` accordingly. This variable makes sure that the assigned ports don't overlap (see [here](https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html#slurm-networking-caveats)).
 
-If you are using SLURM and get the following error returned: "RuntimeError: can't start new thread"
-it is probably caused by your hardware. Try decreasing the CPUs allocated to the job and/or decrease the variable `max_concurrent_trials` in `tune_models.py`.
-If you are using SLURM and your error message contains this: "The process is killed by SIGKILL by OOM killer due to high memory usage", you should increase the assigned memory per CPU (`--mem-per-cpu`).
+To train a model with slurm on multiple nodes or to enable running of multiple ray instances on the same HPC, you can use: `sbatch launch_slurm_cpu_multi_node.sh`. If you (or your collaborators) plan to launch multiple jobs on the same infrastructure you should set the variable `JOB_NB` in `launch_slurm_cpu_multi_node.sh` accordingly. This variable makes sure that the assigned ports don't overlap (see [here](https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html#slurm-networking-caveats)). Currently, the script allows for 3 parallel ray slurm jobs to be executed.
+**Note:** training a model with slurm on multiple nodes can be very specific to your infrastructure. So you might need to adjust this bash script to your set-up.
+
+#### Some common slurm errors:
+If you are using SLURM and ...
+* ... get the following error returned: "RuntimeError: can't start new thread" it is probably caused by thread limits of the cluster. You can try increasing the number of threads allowed `ulimit -u` in  `launch_slurm_cpu.sh` and/or decrease the variable `max_concurrent_trials` in `q2_ritme/config.json`. In case neither helps, it might be worth contacting the cluster administrators.
+
+* ... your error message contains this: "The process is killed by SIGKILL by OOM killer due to high memory usage", you should increase the assigned memory per CPU (`--mem-per-cpu`) in  `launch_slurm_cpu.sh`.
 
 ## Model tracking
-In the config file you can choose to track your trials with MLflow (tracking_uri=="mlruns") or with WandB (tracking_uri=="wandb"). In case of using WandB you need to store your `WANDB_API_KEY` & `WANDB_ENTITY` as a environment variable in `.env`. Make sure to ignore this file in version control (add to `.gitignore`)!
+In the config file you can choose to track your trials with MLflow (tracking_uri=="mlruns") or with WandB (tracking_uri=="wandb").
+
+### MLflow
+In case of using MLflow you can view your models with `mlflow ui --backend-store-uri experiments/mlruns`. For more information check out the [official MLflow documentation](https://mlflow.org/docs/latest/index.html).
+
+### WandB
+In case of using WandB you need to store your `WANDB_API_KEY` & `WANDB_ENTITY` as a environment variable in `.env`. Make sure to ignore this file in version control (add to `.gitignore`)!
 
-The `WANDB_ENTITY` is the project name you would like to store the results in. For more information on this parameter see the official webpage from WandB initialization [here](https://docs.wandb.ai/ref/python/init).
+The `WANDB_ENTITY` is the project name you would like to store the results in. For more information on this parameter see the official webpage for initializing WandB [here](https://docs.wandb.ai/ref/python/init).
 
 Also if you are running WandB from a HPC, you might need to set the proxy URLs to your respective URLs by exporting these variables:
 ```
 export HTTPS_PROXY=http://proxy.example.com:8080
 export HTTP_PROXY=http://proxy.example.com:8080
 ````
 
-## Code test coverage
+## Developers topics - to be removed prior to publication
+### Code test coverage
 To run test coverage with Code Gutters in VScode run:
 ````
 pytest --cov=q2_ritme q2_ritme/tests/ --cov-report=xml:coverage.xml
 ````
 
-## Call graphs
+### Call graphs
 To create a call graph for all functions in the package, run the following commands:
 ````
 pip install pyan3==1.1.1
@@ -65,6 +100,6 @@ pyan3 q2_ritme/**/*.py --uses --no-defines --colored --grouped --annotated --svg
 ````
 (Note: different other options to create call graphs were tested such as code2flow and snakeviz. However, these although properly maintained didn't directly output call graphs such as pyan3 did.)
 
-## Background
-### Why ray tune?
+### Background
+#### Why ray tune?
 "By using tuning libraries such as Ray Tune we can try out combinations of hyperparameters. Using sophisticated search strategies, these parameters can be selected so that they are likely to lead to good results (avoiding an expensive exhaustive search). Also, trials that do not perform well can be preemptively stopped to reduce waste of computing resources. Lastly, Ray Tune also takes care of training these runs in parallel, greatly increasing search speed." [source](https://docs.ray.io/en/latest/tune/examples/tune-xgboost.html#tune-xgboost-ref)
diff --git a/launch_local.sh b/launch_local.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+
+if [ $# -ne 1 ]; then
+    echo "Usage: $0 <CONFIG>"
+    exit 1
+fi
+
+CONFIG=$1
+PORT=6379
+
+if ! nc -z localhost $PORT; then
+    echo "Starting Ray on port $PORT"
+    ray start --head --port=$PORT --dashboard-port=0
+else
+    echo "Ray is already running on port $PORT"
+fi
+
+OUTNAME=$(echo "$CONFIG" | sed -n 's/.*\/\([^/]*\)\.json/\1/p')
+python q2_ritme/run_n_eval_tune.py --config "$CONFIG" > x_"$OUTNAME"_out.txt 2>&1
diff --git a/launch_slurm_cpu.sh b/launch_slurm_cpu.sh
@@ -3,8 +3,8 @@
 #SBATCH --job-name="run_config"
 #SBATCH -A partition_name
 #SBATCH --nodes=1
-#SBATCH --cpus-per-task=20
-#SBATCH --time=23:59:59
+#SBATCH --cpus-per-task=100
+#SBATCH --time=119:59:59
 #SBATCH --mem-per-cpu=4096
 #SBATCH --output="%x_out.txt"
 #SBATCH --open-mode=append
@@ -17,91 +17,12 @@ echo "SLURM_GPUS_PER_TASK: $SLURM_GPUS_PER_TASK"
 # ! USER SETTINGS HERE
 # -> config file to use
 CONFIG="q2_ritme/run_config.json"
-# -> count of this concurrent job launched on same infrastructure
-# -> only these values are allowed: 1, 2, 3 - since below ports are
-# -> otherwise taken or not allowed
-JOB_NB=1
 
 # if your number of threads are limited increase as needed
 ulimit -u 60000
 ulimit -n 524288
 # ! USER END __________
 
-# __doc_head_address_start__
-# script was edited from:
-# https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html
-
-# Getting the node names
-nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
-nodes_array=($nodes)
-
-head_node=${nodes_array[0]}
-head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
-
-# if we detect a space character in the head node IP, we'll
-# convert it to an ipv4 address. This step is optional.
-if [[ "$head_node_ip" == *" "* ]]; then
-IFS=' ' read -ra ADDR <<<"$head_node_ip"
-if [[ ${#ADDR[0]} -gt 16 ]]; then
-  head_node_ip=${ADDR[1]}
-else
-  head_node_ip=${ADDR[0]}
-fi
-echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
-fi
-# __doc_head_address_end__
-
-# __doc_head_ray_start__
-port=$((6378 + JOB_NB))
-node_manager_port=$((6600 + JOB_NB * 100))
-object_manager_port=$((6601 + JOB_NB * 100))
-ray_client_server_port=$((1 + JOB_NB * 10000))
-redis_shard_ports=$((6602 + JOB_NB * 100))
-min_worker_port=$((2 + JOB_NB * 10000))
-max_worker_port=$((9999 + JOB_NB * 10000))
-dashboard_port=$((8265 + JOB_NB))
-
-ip_head=$head_node_ip:$port
-export ip_head
-echo "IP Head: $ip_head"
-
-echo "Starting HEAD at $head_node"
-srun --nodes=1 --ntasks=1 -w "$head_node" \
-    ray start --head --node-ip-address="$head_node_ip" \
-    --port=$port \
-    --node-manager-port=$node_manager_port \
-    --object-manager-port=$object_manager_port \
-    --ray-client-server-port=$ray_client_server_port \
-    --redis-shard-ports=$redis_shard_ports \
-    --min-worker-port=$min_worker_port \
-    --max-worker-port=$max_worker_port \
-    --dashboard-port=$dashboard_port \
-    --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK:-0}" --block &
-# __doc_head_ray_end__
-
-# __doc_worker_ray_start__
-# optional, though may be useful in certain versions of Ray < 1.0.
-sleep 10
-
-# number of nodes other than the head node
-worker_num=$((SLURM_JOB_NUM_NODES - 1))
-
-for ((i = 1; i <= worker_num; i++)); do
-    node_i=${nodes_array[$i]}
-    echo "Starting WORKER $i at $node_i"
-    srun --nodes=1 --ntasks=1 -w "$node_i" \
-        ray start --address "$ip_head" \
-        --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK:-0}" --block &
-    sleep 5
-done
-# __doc_worker_ray_end__
-
-# Output the dashboard URL
-dashboard_url="http://${head_node_ip}:${dashboard_port}"
-export RAY_DASHBOARD_URL="$dashboard_url"
-echo "Ray Dashboard URL: $RAY_DASHBOARD_URL"
-
-# __doc_script_start__
 python -u q2_ritme/run_n_eval_tune.py --config $CONFIG
 sstat -j $SLURM_JOB_ID
 

diff --git a/launch_slurm_cpu_multi_node.sh b/launch_slurm_cpu_multi_node.sh
@@ -0,0 +1,110 @@
+#!/bin/bash
+
+#SBATCH --job-name="run_config"
+#SBATCH -A partition_name
+#SBATCH --nodes=1
+#SBATCH --cpus-per-task=100
+#SBATCH --time=23:59:59
+#SBATCH --mem-per-cpu=4096
+#SBATCH --output="%x_out.txt"
+#SBATCH --open-mode=append
+
+set -x
+
+echo "SLURM_CPUS_PER_TASK: $SLURM_CPUS_PER_TASK"
+echo "SLURM_GPUS_PER_TASK: $SLURM_GPUS_PER_TASK"
+
+# ! USER SETTINGS HERE
+# -> config file to use
+CONFIG="q2_ritme/run_config.json"
+# -> count of this concurrent job launched on same infrastructure
+# -> only these values are allowed: 1, 2, 3 - since below ports are
+# -> otherwise taken or not allowed
+JOB_NB=2
+
+# if your number of threads are limited increase as needed
+ulimit -u 60000
+ulimit -n 524288
+# ! USER END __________
+
+# __doc_head_address_start__
+# script was edited from:
+# https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html
+
+# Getting the node names
+nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
+nodes_array=($nodes)
+
+head_node=${nodes_array[0]}
+head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
+
+# if we detect a space character in the head node IP, we'll
+# convert it to an ipv4 address. This step is optional.
+if [[ "$head_node_ip" == *" "* ]]; then
+IFS=' ' read -ra ADDR <<<"$head_node_ip"
+if [[ ${#ADDR[0]} -gt 16 ]]; then
+  head_node_ip=${ADDR[1]}
+else
+  head_node_ip=${ADDR[0]}
+fi
+echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
+fi
+# __doc_head_address_end__
+
+# __doc_head_ray_start__
+port=$((6378 + JOB_NB))
+node_manager_port=$((6600 + JOB_NB * 100))
+object_manager_port=$((6601 + JOB_NB * 100))
+ray_client_server_port=$((1 + JOB_NB * 10000))
+redis_shard_ports=$((6602 + JOB_NB * 100))
+min_worker_port=$((2 + JOB_NB * 10000))
+max_worker_port=$((9999 + JOB_NB * 10000))
+dashboard_port=$((8265 + JOB_NB))
+
+ip_head=$head_node_ip:$port
+export ip_head
+echo "IP Head: $ip_head"
+
+echo "Starting HEAD at $head_node"
+srun --nodes=1 --ntasks=1 -w "$head_node" \
+    ray start --head --node-ip-address="$head_node_ip" \
+    --port=$port \
+    --node-manager-port=$node_manager_port \
+    --object-manager-port=$object_manager_port \
+    --ray-client-server-port=$ray_client_server_port \
+    --redis-shard-ports=$redis_shard_ports \
+    --min-worker-port=$min_worker_port \
+    --max-worker-port=$max_worker_port \
+    --dashboard-port=$dashboard_port \
+    --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK:-0}" --block &
+# __doc_head_ray_end__
+
+# __doc_worker_ray_start__
+# optional, though may be useful in certain versions of Ray < 1.0.
+sleep 10
+
+# number of nodes other than the head node
+worker_num=$((SLURM_JOB_NUM_NODES - 1))
+
+for ((i = 1; i <= worker_num; i++)); do
+    node_i=${nodes_array[$i]}
+    echo "Starting WORKER $i at $node_i"
+    srun --nodes=1 --ntasks=1 -w "$node_i" \
+        ray start --address "$ip_head" \
+        --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK:-0}" --block &
+    sleep 5
+done
+# __doc_worker_ray_end__
+
+# Output the dashboard URL
+dashboard_url="http://${head_node_ip}:${dashboard_port}"
+export RAY_DASHBOARD_URL="$dashboard_url"
+echo "Ray Dashboard URL: $RAY_DASHBOARD_URL"
+
+# __doc_script_start__
+python -u q2_ritme/run_n_eval_tune.py --config $CONFIG
+sstat -j $SLURM_JOB_ID
+
+# get elapsed time of job
+echo "TIME COUNTER:"
+sacct -j $SLURM_JOB_ID --format=elapsed --allocations
diff --git a/launch_slurm_cpu_own_ss.sh b/launch_slurm_cpu_own_ss.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+#SBATCH --job-name="r_optuna_own_ss_rf"
+#SBATCH -A es_bokulich
+#SBATCH --nodes=1
+#SBATCH --cpus-per-task=100
+#SBATCH --time=119:59:59
+#SBATCH --mem-per-cpu=4096
+#SBATCH --output="%x_out.txt"
+#SBATCH --open-mode=append
+
+set -x
+
+echo "SLURM_CPUS_PER_TASK: $SLURM_CPUS_PER_TASK"
+echo "SLURM_GPUS_PER_TASK: $SLURM_GPUS_PER_TASK"
+
+# ! USER SETTINGS HERE
+# -> config file to use
+CONFIG="q2_ritme/r_optuna_own_ss_rf.json"
+
+# if your number of threads are limited increase as needed
+ulimit -u 60000
+ulimit -n 524288
+# ! USER END __________
+
+python -u q2_ritme/run_n_eval_tune.py --config $CONFIG
+sstat -j $SLURM_JOB_ID
+
+# get elapsed time of job
+echo "TIME COUNTER:"
+sacct -j $SLURM_JOB_ID --format=elapsed --allocations
diff --git a/q2_ritme/evaluate_models.py b/q2_ritme/evaluate_models.py
@@ -142,10 +142,7 @@ def select(self, data, split):
             # assign self.train_selected_fts to be able to run select on test set later
             train_selected = select_microbial_features(
                 data,
-                self.data_config["data_selection"],
-                self.data_config["data_selection_i"],
-                self.data_config["data_selection_q"],
-                self.data_config["data_selection_t"],
+                self.data_config,
                 ft_prefix,
             )
             self.train_selected_fts = train_selected.columns