This repo contains the code used to pretrain and finetune LOCOST.
The scripts about state-space models are adapted from the official H3 repository.
Pre-trained models are available on the HuggingFace model hub.
Install both packages in the csrc/
folder:
cd csrc
cd fftconv
pip install ./
cd ../cauchy
pip install ./
We expect the datasets to be tokenized with the base LongT5 tokenizer. This formatting can be done with the script preprocess_data.py
.
These scripts rely on a .env
file, and is used through the python-dotenv package. Make sure to define here:
DATASET_PATH
, the base folder where are stored the dataset.TOKENIZER_PATH
, the path to the model tokenizer (we used the LongT5 tokenizer).CHECKPOINT_PATH
to save the models checkpoint during training.
The pretraining is ran with PytorchLightning and tracked with wandb
.
TRANSFORMERS_NO_ADVISORY_WARNINGS="true" python pretrain_script.py --dataset path/to/pretraining/dataset --config configs/pretraining/locost.yaml --wandb_name locost-pretraining