NLP research template for training language models using PyTorch + Lightning + Weights & Biases + HuggingFace. It's built to be customized but provides comprehensive, sensible default functionality.
If you are not doing NLP or want to use your own training code or template, the setup and environment tooling with Docker, mamba
, and conda-lock
in this template might still be interesting for you.
It's recommended to use mamba
to manage dependencies. mamba
is a drop-in replacement for conda
re-written in C++ to speed things up significantly (you can stick with conda
though). To provide reproducible environments, we use conda-lock
to generate lockfiles for each platform.
Installing mamba
On Unix-like platforms, run the snippet below. Otherwise, visit the mambaforge repo. Note this does not use the Anaconda installer, which reduces bloat.
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
Installing conda-lock
The preferred method is to install conda-lock
using pipx install conda-lock
. For other options, visit the conda-lock repo. For basic usage, have a look at the commands below:
conda-lock install --name gpt5 conda-lock.yml # create environment with name gpt5 based on lockfile
conda-lock # create new lockfile based on environment.yml
conda-lock --update <package-name> # update specific packages in lockfile
Lockfiles are an easy way to exactly reproduce an environment.
After having installed mamba
and conda-lock
, you can create a mamba
environment named gpt5
from a lockfile with all necessary dependencies installed like this:
conda-lock install --name gpt5 conda-lock.yml
You can then activate your environment with
mamba activate gpt5
To generate new lockfiles after updating the environment.yml
file, simply run conda-lock -f environment.yml
.
Setup on ppc64le
If you're not using a PowerPC machine, do not worry about this.
Whenever you create an environment for a different processor architecture, some packages (especially pytorch
) need to be compiled specifically for that architecture. IBM PowerPC machines for example use a processor architecture called ppc64le
.
Setting up the environment ppc64le
is a bit tricky because the official channels do not provide packages compiled for ppc64le
. However, we can use the amazing Open-CE channel instead. A lockfile containing the relevant dependencies is already prepared in ppc64le.conda-lock.yml
and the environment again can be simply installed with:
conda-lock install --name gpt5-ppc64le ppc64le.conda-lock.yml
Dependencies for ppc64le
should go into the separate ppc64le.environment.yml
file. Use the following command to generate a new lockfile after updating the dependencies:
conda-lock --file ppc64le.environment.yml --lockfile ppc64le.conda-lock.yml
For fully reproducible environments and running on HPC clusters, we provide pre-built docker images at konstantinjdobler/nlp-research-template. We also provide a Dockerfile
that allows you to build new docker images with updated dependencies:
# first update `environment.yml` with your dependencies
# then this command will create a new conda-lock.yml file
conda-lock -f environment.yml
# this automatically uses your latest conda-lock.yml to create a reproducible docker image
docker build --tag <username>/<imagename>:<tag> --platform="linux/amd64" .
The specified username should be your personal dockerhub
username. This will make distribution and usage of your images easier with docker push/pull <your image>
.
We also provide shell commands and a convenience script to run all your training commands inside docker (recommended).
After all of this setup you are finally ready for some training. First of all, you need to create your data directory with a train.txt
and dev.txt
. Then you can start a training run in your environment with:
python train.py -n <run-name> -d /path/to/data --model roberta-base --offline
To see an overview over all options and their defaults, run python train.py --help
or have a look inside args.py
. We have disabled Weights & Biases syncing with the --offline
flag. If you want to log your results, enable W&B as described here and omit the --offline
flag.
Using GPUs for hardware acceleration
By default, train.py
tries to use a single CUDA GPU if available. If you want to train on multiple GPUs, increase the --num_devices
flag (this then uses DistributedDataParallel
under the hood). IMPORTANT: you should always select the GPUs that are visible to the script via the CUDA_VISIBLE_DEVICES
environment variable (e.g. CUDA_VISIBLE_DEVICES=0,2 python train.py ...
) or via the docker flags if training inside a container (recommended). To use different hardware accelerators, use the --accelerator
flag. You can use advanced parallel training strategies with --distributed_strategy
.
To conveniently run the training code inside a docker container, you can use the run-in-docker.sh script.
# execute the training inside your container
# -g 2 means only GPU 2 is visible to the script
# -g 0,2 would make the GPUs 0 and 2 visible
bash ./scripts/run-in-docker.sh -g 2 python train.py --num_devices 1 -n <run-name> -d /path/to/data/ --model roberta-base --offline
By default (no -g
flag), no GPUs are available inside the container. You probably want to adjust the run-in-docker.sh
script to add your own mounts for data and other things you want to load / save.
Docker + GPUs: You should always select specific GPUs to be visible inside the container. When using the run-in-docker.sh
script, use the -g
flag. When using docker natively, use e.g. --gpus='"device=0,7"'
(for the GPUs 0
and 7
) and adjust the --num_devices
flag according to your number of selected GPUs. Yes, the weird format of --gpus='"device=0,7"'
is important, otherwise the shell might not pass the flag correctly to nvidia-docker
(official Nvidia recommendation).
Single-line docker command
You can start a script inside a docker container in a single command:
docker run -it --user $(id -u):$(id -g) --ipc host -v "$(pwd)":/workspace -w /workspace --gpus='"device=7"' konstantinjdobler/nlp-research-template:latest python train.py --num_devices=1 ...
Since we have not mounted any cache directories (only the current working directory with $(pwd)
), nothing that is written to disk outside $(pwd)
is persistent in this example. You can add those with -v
or --mount
.
Using Docker with SLURM / pyxis
For security reasons, docker
might be disabled on your HPC cluster. You might be able to use the SLURM plugin pyxis
instead like this:
srun ... --container-image konstantinjdobler/nlp-research-template:latest python train.py ...
This uses enroot
under the hood to import your docker image and run your code inside the container. See the pyxis
documentation for more options, such as --container-mounts
or --container-writable
.
It might take a long time to start the container. You can prepare this by doing enroot import docker://konstantinjdobler/nlp-research-template:latest -o prepared-image.sqsh
and then modify the srun
:
srun ... --container-image /path/to/prepared-image.sqsh python train.py ...
If you want to run an interactive session with bash don't forget the --pty
flag.
Weights & Biases allows you to easily log metrics, training results, checkpoints, and hyperparameters. To enable Weights & Biases, enter your WANDB_ENTITY
and WANDB_PROJECT
in train.py and omit the --offline
flag for training.
Weights & Biases + Docker
When using docker we also have to get our WANDB_API_KEY
inside the container. You can find your personal API key at wandb.ai/authorize. Set WANDB_API_KEY
on your host machine and use the docker
flag --env WANDB_API_KEY
when starting your run. Or just use the run-in-docker.sh
script, which will try to parse the WANDB_API_KEY
from your ~/.netrc
file (or get it from the environment).
To save the exact configurations of experiments and save yourself some time typing out arguments in the command line, you can use .yml
style config files supplied via the --config_path
argument. You can also combine multiple configs. The order of importance is default args < config args (multiple configs are resolved in order) < command line args.
python train.py --config_path ./cfgs/example.yml ./cfgs/llama-from-scratch.yml --devices 8 -n my-training-run ...
If you want to connect to a remote host machine with GPUs for development, we recommend the VS Code Remote-SSH extension.
Ideally, you should also do your development inside the same docker container to reduce a mismatch between training and development. For this, use VS Code Dev Containers
. They allow you to develop in VS Code inside a docker container with full support for IntelliSense, type hints and more. The template already contains a .devcontainer
directory, where all the settings for it are stored - you can start right away!
VS Code Dev Container
example
After having installed the Remote-SSH-, and Dev Containers-Extension, you set up your Dev Container
in the following way:
- Establish the SSH-connection with the host by opening your VS Code command pallette and typing
Remote-SSH: Connect to Host
. Now you can connect to your host machine. - Open the folder that contains this template on the host machine.
- VS Code will automatically detect the
.devcontainer
directory and ask you to reopen the folder in a Dev Container. Alternatively, use the command pallette and typeDev Containers
. - Press
Reopen in Container
and wait for VS Code to set everything up. for the first time or when you changedevcontainer.json
, you will need to doRebuild and reopen in Container
.
There is a bit of setup: for a proper dev environment, you will need to configure mounts (cache directories, your datasets, ...) and environment variables like for a regular docker run command, have a look inside .devcontainer/devcontainer.json
.
conda-lock
is automatically installed for you but you have to add the --micromamba
flag inside the Dev Container (e.g. conda-lock --micromamba -f environment.yml
). Otherwise, conda-lock uses an anaconda installation, which takes over 8 hours to resolve the packages in the environments.
We automatically mount the ~/.gitconfig
and ~/.netrc
files for ease of use of Git and W&B, however these files have to exist on your host machine. They are created when executing git config --global user.email your.name@domain.com
and wandb login
, respectively.
If you want to use GPUs for development, you also need to specify the GPU you want to use in .devcontainer/devcontainer.json
. However, this is a bit cumbersome if you are often switching between GPUs. Alternatively, you edit your code in the Dev Container (without a GPU) but start all actual development runs of your script like you would for training with run-in-docker.sh
and select the GPU ad-hoc. The nice advantage of Dev Containers is that you are still using the exact same docker container for both.
Sometimes it's just quicker or unavoidable to create an environment via conda-lock install --name gpt5 conda-lock.yml
instead of using Docker. In most cases, this is fine since we are using lockfiles but there might be some tricky edge cases depending on the platform and OS. Just be careful to keep any local environments and your docker containers in sync. Docker containers also allow more advanced support for compiled CUDA kernels such as FlashAttention.
We use the ruff
linter and black
formatter. You should install their VS Code extensions and enable "Format on Save" inside VS Code.
Our project uses GitHub Actions for CI/CD to automate the building and pushing of our Docker images to Docker Hub. This ensures that our Docker images are always up-to-date with the latest dependencies specified in conda-lock.yml
.
To work with this CI/CD setup, you need to:
- Set the following secrets in your GitHub repository:
DOCKER_REGISTRY
: The Docker registry URL (if using Docker Hub, this is not needed).DOCKER_REGISTRY_TOKEN
: Your Docker Hub access token or password.
- Replace
konstantinjdobler
and mentions ofnlp-research-template
with your own Docker ID in the workflow file.github/workflows/docker.yml
If you do not want to automatically build and push images, just delete the workflow file.
To update the Docker image:
- Make necessary changes to the
Dockerfile
or update dependencies in theenvironment.yml
. - Generate a new
conda-lock.yml
by runningconda-lock -f environment.yml
. - Commit and push the changes to the
main
branch. - The GitHub Actions workflow will automatically build and push the new Docker image to Docker Hub.
The Docker images are tagged with the PyTorch and CUDA versions extracted from conda-lock.yml
, as well as a latest
tag for the most recent build. Use the specific tags if you need a particular version of PyTorch or CUDA, or use the latest
tag for the most recent build.