This repository contains the code for the paper Compute Better Spent: Replacing Dense Layers with Structured Matrices by Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, and Andrew Gordon Wilson.
Dense linear layers are the dominant computational bottleneck in foundation models. Identifying more efficient alternatives to dense matrices has enormous potential for building more compute-efficient models, as exemplified by the success of convolutional networks in the image domain. In this work, we systematically explore structured matrices as replacements for dense matrices. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance, especially as models scale. Using insights from the Maximal Update Parameterization, we determine the optimal scaling for initialization and learning rates of these unconventional layers. Finally, we measure the scaling laws of different structures to compare how quickly their performance improves with compute. We propose a novel matrix family containing Monarch matrices, the Block Tensor-Train (BTT), which we show performs better than dense matrices for the same compute on multiple tasks. On CIFAR-10/100 with augmentation, BTT achieves exponentially lower training loss than dense when training MLPs and ViTs. BTT matches dense ViT-S/32 performance on ImageNet-1k with 3.8 times less compute and is more efficient than dense for training small GPT-2 language models.
Please cite this work as:
@article{qiu2024compute,
title={{Compute Better Spent: Replacing Dense Layers with Structured Matrices}},
author={Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, and Andrew Gordon Wilson},
journal={International Conference on Machine Learning (ICML)},
year={2024}
}
To setup the environment, simply run
source setup.sh
This script installs two conda environments named struct
and gpt
. Use the struct
for CIFAR-10/100 and ImageNet experiments, and gpt
for GPT-2 experiments. struct
has more complex dependencies and will take a while to install.
The basic GPT-2 training code and model definition is adapted from nanoGPT.
The directories timm
and scaling_mlps
are forked from the repos pytorch-image-models and scaling_mlps respectively.
conda activate struct
python scaling_mlps/data_utils/dataset_to_beton.py --dataset_name cifar10 --mode train --res 32
python scaling_mlps/data_utils/dataset_to_beton.py --dataset_name cifar10 --mode val --res 32
python scaling_mlps/data_utils/dataset_to_beton.py --dataset_name cifar100 --mode train --res 32
python scaling_mlps/data_utils/dataset_to_beton.py --dataset_name cifar100 --mode val --res 32
Download and extract the ImageNet dataset from here.
conda activate gpt
python prepare_owt.py
Run the following scripts and then run notebooks/struct_aware_lr.ipynb
to make the figures.
sh experiments/check_dh.sh
sh experiments/lr_landscape.sh
Then run notebooks/struct_aware_lr.ipynb
to reproduce the figure.
sh experiments/mlp.sh
sh experiments/vit.sh
Visualize the scaling laws by running notebooks/scaling_laws.ipynb
.
sh experiments/imagenet.sh
sh experiments/gpt.sh