Skip to content

Latest commit

 

History

History
209 lines (158 loc) · 7.54 KB

EXPERIMENTS.md

File metadata and controls

209 lines (158 loc) · 7.54 KB

Experiments

Table of Contents

This experimentation section provides guidelines for running model training, testing, and experimentation using multiple GPUs and configuration files.

ftw model --help
  Usage: ftw model [OPTIONS] COMMAND [ARGS]...

  Model-related commands.

Options:
  --help  Show this message and exit.

Commands:
  fit        Fit the model
  test       Test the model

Our current best model architecture comprises unet with efficientnet-b3 backbone trained with weighted cross-entropy loss. Please see the config files under configs/FTW-Release folder for more information.

Pre-requisites for experimentation

Before running experiments, make sure to create configuration files in the configs directory. These files should specify the root directory of the dataset. Additionally, update the root argument in datasets.py to reflect the correct dataset path.

example_config gives an idea of the parameters that can be changed to spin out experiments.

trainer:
  max_epochs: <E.G. 100, NUMBER OF EPOCHS>
  log_every_n_steps: <E.G. 10, LOGGING FREQUENCY>
  accelerator: <E.G. "gpu", ACCELERATOR>
  default_root_dir: <LOGS DIRECTORY>
  devices:
    - <DEVICE ID>
  callbacks:
    - class_path: lightning.pytorch.callbacks.ModelCheckpoint
      init_args:
        monitor: val_loss
        mode: min
        save_top_k: <E.G. 0, NUMBER OF MODELS TO SAVE>
        save_last: <TRUE / FALSE, WHETHER TO SAVE THE LAST MODEL OR NOT>
        filename: "{epoch}-{val_loss:.2f}"
model:
  class_path: ftw.trainers.CustomSemanticSegmentationTask
  init_args:
    loss: <E.G. "jaccard", LOSS FUNCTION>
    model: <E.G. "unet", MODEL>
    backbone: <E.G. "efficientnet-b3", BACKBONE MODEL>
    weights: <TRUE / FALSE, WHETHER TO USE PRETRAINED WEIGHTS OR NOT>
    patch_weights : <TRUE / FALSE, WHETHER TO PATCH THE WEIGHTS IN A CUSTOM FORMAT OR NOT>
    in_channels: <E.G. 8, NUMBER OF INPUT CHANNELS>
    num_classes: <E.G. 3, NUMBER OF CLASSES>
    num_filters: <E.G. 64, NUMBER OF FILTERS>
    ignore_index: null
    lr: <E.G. 1e-3, LEARNING RATE>
    patience: <E.G. 100, PATIENCE FOR COSINE ANNEALING>
data:
  class_path: ftw.datamodules.FTWDataModule
  init_args:
    batch_size: 32
    num_workers: 8
    num_samples: -1
    train_countries:
      - country 1
      - country 2
    val_countries:
      - country 1
      - country 2
    test_countries:
      - country 1
      - country 2
  dict_kwargs:
    root: <ROOT FOLDER OF THE DATASET>
    load_boundaries: <TRUE / FALSE WHETHER TO LOAD 3 CLASS MASKS OR NOT>
seed_everything: <SEED VALUE>

Model Weights

For the training and experimentation, we utilize pre-trained model weights available on Hugging Face. You can download these weights directly from the Hugging Face Model Hub.

Training

We use LightningCLI to streamline the training process, leveraging configuration files to define the model architecture, dataset, and training parameters.

Train a model from scratch

ftw model fit --help

Usage: ftw model fit [OPTIONS] [CLI_ARGS]...

  Fit the model

Options:
  -c, --config PATH  Path to the config file (required)  [required]
  --ckpt_path PATH   Path to a checkpoint file to resume training from
  --help             Show this message and exit.

You can train your model using a configuration file as follows:

ftw model fit --config configs/example_config.yaml

Resume training from a checkpoint

If training has been interrupted or if you wish to fine-tune a pre-trained model, you can resume training from a checkpoint:

ftw model fit --config configs/example_config.yaml --ckpt_path <Checkpoint File Path>

Visualizing the training process

Tensorboard is used to log the training and validation steps. The logs are situated in folders specified in the config.yaml files. Search for this line in the configuration.

default_root_dir: logs/<experiment name>

To visualize the the logs run pass the log directory of the experiments in the --logdir argument:

tensorboard --logdir=YOUR_DIR

Currently logged informations are:

  • Train multi class accuracy
  • Train multi class jaccard index
  • Train loss
  • Validation multi class accuracy
  • Validation multi class jaccard index
  • Validation loss
  • 10 prediction results (images)
  • Learning rate

Testing

Once your model has been trained, you can evaluate it on the test set specified in your datamodule. This can be done using the same configuration file used for training.

ftw model test --help

Usage: ftw model test [OPTIONS] [CLI_ARGS]...

  Test the model

Options:
  -m, --model TEXT            Path to model checkpoint  [required]
  --dir TEXT                  Root directory of dataset
  --gpu INTEGER               GPU to use
  --countries TEXT            Countries to evaluate on  [required]
  --postprocess               Apply postprocessing to the model output
  --iou_threshold FLOAT       IoU threshold for matching predictions to ground
                              truths
 -o, --out TEXT               Output file for metrics
  --model_predicts_3_classes  Whether the model predicts 3 classes or 2
                              classes
  --test_on_3_classes         Whether to test on 3 classes or 2 classes
  --temporal_options TEXT     Temporal option (stacked, windowA, windowB,
                              etc.)
  --help                      Show this message and exit.

Test a model

Using FTW cli commands to test the model, you can pass specific options, such as selecting the GPUs, providing checkpoints, specifying countries for testing, and postprocessing results:

ftw model test --gpu 0 --dir /path/to/dataset --model logs/path_to_model/checkpoints/last.ckpt --countries country_to_test_on --out results.csv

This will output test results into results.csv after running on the selected GPUs and processing the specified countries.

Note: If data directory path is custom (not default ./data/) then make sure to pass custom data directory path in testing using --dir custom_dir/ftw.

Parallel experimentation

For running multiple experiments across different GPUs in parallel, the provided Python script run_experiments.py can be used. It efficiently manages and distributes training tasks across available GPUs by using multiprocessing and queuing mechanisms.

Run experiments in parallel

  1. Define the list of experiment configuration files in the experiment_configs list.
  2. Specify the list of GPUs in the GPUS variable (e.g., [0,1,2,3]).
  3. Set DRY_RUN = False to execute the experiments.

The script automatically detects the available GPUs and runs the specified experiments on them. Each experiment will use the configuration file specified in experiment_configs.

python run_experiments.py

The script will distribute the experiments across the specified GPUs using a queue, with each experiment running on the corresponding GPU.