This project is derived from an assignement I did during my bootcamp at Yotta Academy. It aims to classify the morphologies of distant galaxies using deep neural networks.
It is based on the Kaggle Galaxy Zoo Challenge.
Originaly posed as a regression problem in the Kaggle challenge, with formulate it here as a multiclass classification problem since this is eventually the goal behind the project. Additionaly, this has the added benefit to simplify things a bit.
To better understand the task to be learned by the model, give it a go yourself: try it here.
Checkout my experiments and the project's report on Weights & Biases.
A few related papers on the topic are available here:
Ensure your gpu driver & cuda are properly setup for pytorch to use it (the name of your device should appear):
nvidia-smi
If you don't have it already — I highly recommend it! — install poetry:
make setup-poetry
Setup the environment with python 3.10, e.g. using miniconda (easier IMO):
git clone git@github.com:aliberts/galaxy-zoo.git
cd galaxy-zoo
conda create --yes --name gzoo python=3.10
conda activate gzoo
poetry install
or pyenv:
git clone git@github.com:aliberts/galaxy-zoo.git
cd galaxy-zoo
pyenv install 3.10:latest
pyenv local 3.10:latest
poetry install
Download the dataset:
make dataset
This will download and extract the archives into dataset/
. You'll need to login with Kaggle's API first and place your kaggle.json
api key inside ~/.kaggle
by default.
You can also do it manually by downloading it here. In that case, don't forget to update the location of the directory you put it in with the dataset.dir
config option.
Make your commands shorter with this alias
:
alias py='poetry run python'
If you intend to contribute in this repo, install the pre-commit hooks with:
pre-commit install
You're good to go!
poetry run python -m gzoo.app.make_labels
This will produce the classification_labels.csv
file inside dataset/
, which is needed for training. These class labels are produced from the original regression labels in training_solutions_rev1.csv
.
poetry run python -m gzoo.app.split_data
This will split the dataset into the training / validation / testing partitions and write those partitions in a clf_labels_split.csv
file. The ratios used for the partitionning are set in the dataset.test_split_ratio
and dataset.val_split_ratio
config options.
.
poetry run python -m gzoo.app.train
script option:
--config_path
: specify the.yaml
config file to read options from. Every run config option should be listed in this file (the default file for this is config/train.yaml) and every option in that file can be overloaded on the fly at the command line.
For instance, if you are fine with the values in the yaml
config file but you just want to change the epochs
number, you can either change it in the config file or you can directly run:
poetry run python -m gzoo.app.train --compute.epochs=50
This will use all config values from config/train.yaml
except the number of epochs which will be set to 50
.
main run options:
--compute.seed
: seed for deterministic training. (default:None
)--compute.epochs
: total number of epochs (default:90
)--compute.batch-size
: batch size (default:128
)--compute.workers
: number of data-loading threads (default:8
)--model.arch
: model architecture to be used (default:resnet18
)--model.pretrained
: use pre-trained model (default:False
)--optimizer.lr
: optimizer learning rate (default:3.e-4
with Adam)--optimizer.momentum
: optimizer momentum (for SGD only, default:0.9
)--optimizer.weight-decay
: optimizer weights regularization (L2, default1.e-4
)
poetry run python -m gzoo.app.predict
Config works the same way as for training, default config is at config/predict.yaml.
A 1-image example is provided which you can run with:
poetry run python -m gzoo.app.predict --dataset.dir=example/
If you make changes in gzoo.infra.config, you should also update the related .yaml
config files in config/ with:
make config