Compute Better Spent: Replacing Dense Layers with Structured Matrices

This repository contains the code for the paper Compute Better Spent: Replacing Dense Layers with Structured Matrices by Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, and Andrew Gordon Wilson.

Dense linear layers are the dominant computational bottleneck in foundation models. Identifying more efficient alternatives to dense matrices has enormous potential for building more compute-efficient models, as exemplified by the success of convolutional networks in the image domain. In this work, we systematically explore structured matrices as replacements for dense matrices. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance, especially as models scale. Using insights from the Maximal Update Parameterization, we determine the optimal scaling for initialization and learning rates of these unconventional layers. Finally, we measure the scaling laws of different structures to compare how quickly their performance improves with compute. We propose a novel matrix family containing Monarch matrices, the Block Tensor-Train (BTT), which we show performs better than dense matrices for the same compute on multiple tasks. On CIFAR-10/100 with augmentation, BTT achieves exponentially lower training loss than dense when training MLPs and ViTs. BTT matches dense ViT-S/32 performance on ImageNet-1k with 3.8 times less compute and is more efficient than dense for training small GPT-2 language models.

Please cite this work as:

@article{qiu2024compute,
    title={{Compute Better Spent: Replacing Dense Layers with Structured Matrices}},
    author={Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, and Andrew Gordon Wilson},
    journal={International Conference on Machine Learning (ICML)},
    year={2024}
}

Setup

To setup the environment, simply run

source setup.sh

This script installs two conda environments named struct and gpt. Use the struct for CIFAR-10/100 and ImageNet experiments, and gpt for GPT-2 experiments. struct has more complex dependencies and will take a while to install.

The basic GPT-2 training code and model definition is adapted from nanoGPT. The directories timm and scaling_mlps are forked from the repos pytorch-image-models and scaling_mlps respectively.

Datasets

CIFAR-10/100

conda activate struct
python scaling_mlps/data_utils/dataset_to_beton.py --dataset_name cifar10 --mode train --res 32
python scaling_mlps/data_utils/dataset_to_beton.py --dataset_name cifar10 --mode val --res 32
python scaling_mlps/data_utils/dataset_to_beton.py --dataset_name cifar100 --mode train --res 32
python scaling_mlps/data_utils/dataset_to_beton.py --dataset_name cifar100 --mode val --res 32

ImageNet

Download and extract the ImageNet dataset from here.

OpenWebText

conda activate gpt
python prepare_owt.py

Experiments

Structure-Aware Learning Rate

Run the following scripts and then run notebooks/struct_aware_lr.ipynb to make the figures.

Figure 2a

sh experiments/check_dh.sh

Figure 2b

sh experiments/lr_landscape.sh

Then run notebooks/struct_aware_lr.ipynb to reproduce the figure.

CIFAR-10/100

sh experiments/mlp.sh
sh experiments/vit.sh

Visualize the scaling laws by running notebooks/scaling_laws.ipynb.

ImageNet

sh experiments/imagenet.sh

GPT-2

sh experiments/gpt.sh

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
config		config
experiments		experiments
imagenet		imagenet
mm		mm
model		model
nn		nn
notebooks		notebooks
ops		ops
optimizers		optimizers
scaling_mlps		scaling_mlps
timm		timm
LICENSE		LICENSE
README.md		README.md
conda_env.yml		conda_env.yml
prepare_owt.py		prepare_owt.py
setup.sh		setup.sh
train_cifar.py		train_cifar.py
train_dh.py		train_dh.py
train_gpt.py		train_gpt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compute Better Spent: Replacing Dense Layers with Structured Matrices

Setup

Datasets

CIFAR-10/100

ImageNet

OpenWebText

Experiments

Structure-Aware Learning Rate

Figure 2a

Figure 2b

CIFAR-10/100

ImageNet

GPT-2

About

Releases

Packages

Languages

License

shikaiqiu/compute-better-spent

Folders and files

Latest commit

History

Repository files navigation

Compute Better Spent: Replacing Dense Layers with Structured Matrices

Setup

Datasets

CIFAR-10/100

ImageNet

OpenWebText

Experiments

Structure-Aware Learning Rate

Figure 2a

Figure 2b

CIFAR-10/100

ImageNet

GPT-2

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages