$ cd $HULC_ROOT/slurm_scripts
$ python slurm_training.py --venv hulc_venv datamodule.root_data_dir=/path/to/dataset/
This assumes that --venv hulc_venv
specifies a conda environment.
To use virtualenv instead, change line 18 of sbatch_lfp.sh accordingly.
All hydra arguments can be used as in the normal training.
Use the following optional command line arguments for slurm:
--log_dir
: slurm log directory--job_name
: slurm job name--gpus
: number of gpus--mem
: memory--cpus
: number of cpus--days
: time limit in days--partition
: name of slurm partition
The script will create a new folder in the specified log dir with a date tag and the job name. This is done before the job is submitted to the slurm queue. In order to ensure reproducibility, the current state of the calvin repository is copied to the log directory at submit time and is locally installed, such that you can schedule multiple trainings and there is no interference with future changes to the repository.
Every job submission creates a resume_training.sh
script in the log folder. To resume a training,
call $ sh <PATH_TO_LOG_DIR>/resume_training.sh
. By default, the model loads the latest saved checkpoint.
To evaluate a trained model via slurm, run $ sh <PATH_TO_LOG_DIR>/evaluate.sh
, which will automatically place a job on the
same partition as it was trained on. Note that this script is also autogenerated.