This is the official code for our paper "Revisiting Pretraining Objectives for Tabular Deep Learning" (paper)
Check out other projects on tabular Deep Learning: link.
Feel free to report issues and post questions/feedback/ideas.
You can view all the results and build your own tables with this notebook.
- Install conda (just to manage the env).
- Run the following commands
export REPO_DIR=/path/to/the/code cd $REPO_DIR conda create -n tdl python=3.9.7 conda activate tdl pip install torch==1.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html pip install -r requirements.txt # if the following commands do not succeed, update conda conda env config vars set PYTHONPATH=${PYTHONPATH}:${REPO_DIR} conda env config vars set PROJECT_DIR=${REPO_DIR} conda activate tdl
Here we describe the neccesary info for reproducing the experimental results.
We upload the datasets used in the paper with our train/val/test splits here. We do not impose additional restrictions to the original dataset licenses, the sources of the data are listed in the paper appendix.
You could load the datasets with the following commands:
conda activate tdl
cd $PROJECT_DIR
wget "https://www.dropbox.com/s/cj9ex11u6ri0tdy/tabular-pretrains-data.tar?dl=1" -O tabular-pretrains-data.tar
tar -xvf tabular-pretrains-data.tar
There are multiple scripts inside the bin
directory for various pretraining objectives, finetuning from checkpoints (same script is also used to train from scratch) and GBDT baselines.
Each pretraining script follows the same structure. It constructs different models given their configs (MLPs, MLPs with numerical embeddings, ResNets, Transformers) and pretrains them with periodically calling the finetune script for early stopping (or finetuning only at the end if early_stop_type = "pretrain"
is specified in config).
There are two variations of each script: single GPU and DDP multi-GPU (used for large dataset and models with embeddings), which are identical, except DDP related modifications.
bin/finetune.py
are used to train models from scratch, or finetune pretrained checkpointsbin/contrastive.py
-- contrastive objective.bin/[rec|mask]_(supervised)
-- self-prediction objective variations
To run the target-aware mask prediction pretraining on the california housing dataset you could run the following code snippet. It will clone the tuning config, then tune and evaluate mlp-plr with target-aware mask prediction pretraining and create the ensemble
conda activate tdl
cd $PROJECT_DIR
mkdir -p exp/draft
cp exp/mask-target/mlp-p-lr/california/3_tuning.toml exp/draft/example_tuning.toml
export CUDA_VISIBLE_DEVICES=0
python bin/tune.py exp/draft/example_tuning.toml
python bin/evaluate.py exp/draft/example_tuning 15
python bin/ensemble.py exp/draft/example_evaluation