Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

Introduction

This is the official implementation of the paper "Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning" on SWAG.

We will also implement several token-reduction methods on SWAG.

Results

Ours

Method	Backbone	$\alpha$	h $\times$ w	GFLOPs	FPS	Throughput (im/s)	Acc.
SWAG	ViT-L/16	-	32 $\times$ 32	364.8	12.7	14.7	88.1
SWAG + Ours	ViT-L/16	12	23 $\times$ 23	275.8	15.8	19.5	87.8
SWAG + Ours	ViT-L/16	6	23 $\times$ 23	228.5	19.4	24.3	87.7
SWAG + Ours	ViT-L/16	8	16 $\times$ 16	179.0	23.3	29.7	87.2
SWAG	ViT-H/14	-	37 $\times$ 37	1023.2	5.0	5.3	88.5
SWAG + Ours	ViT-H/14	10	26 $\times$ 26	652.4	7.4	8.4	88.4
SWAG + Ours	ViT-H/14	4	25 $\times$ 25	514.1	9.8	11.0	87.6
SWAG + Ours	ViT-H/14	8	18 $\times$ 18	422.7	11.2	13.1	87.4

EViT

Method	Backbone	keep rate	GFLOPs	FPS	Throughput (im/s)	Acc.
SWAG	ViT-L/16	-	364.8	12.7	14.7	88.1
SWAG + EViT	ViT-L/16	0.8	264.9	17.4	20.1	87.9
SWAG + EViT	ViT-L/16	0.7	227.1	19.7	23.5	87.8
SWAG + EViT	ViT-L/16	0.5	170.2	24.9	31.0	87.1
SWAG	ViT-H/14	-	1023.2	5.0	5.3	88.5
SWAG + EViT	ViT-H/14	0.7	632.3	8.0	8.6	88.4
SWAG + EViT	ViT-H/14	0.6	544.4	9.2	10.0	88.3
SWAG + EViT	ViT-H/14	0.5	472.0	10.3	11.5	88.0

ToMe

Method	Backbone	r	GFLOPs	FPS	Throughput (im/s)	Acc.
SWAG	ViT-L/16	-	364.8	12.7	14.7	88.1
SWAG + ToMe	ViT-L/16	10	317.5	13.8	15.7	88.0
SWAG + ToMe	ViT-L/16	20	271.2	16.2	18.9	87.9
SWAG + ToMe	ViT-L/16	40	184.3	22.7	28.1	87.7
SWAG	ViT-H/14	-	1023.2	5.0	5.3	88.5
SWAG + ToMe	ViT-H/14	10	890.0	5.3	5.9	88.5
SWAG + ToMe	ViT-H/14	20	760.0	6.3	7.1	88.4
SWAG + ToMe	ViT-H/14	40	516.9	9.2	10.7	88.2

ACT

Method	Backbone	q_hashes	GFLOPs	FPS	Throughput (im/s)	Acc.
SWAG	ViT-L/16	-	364.8	12.7	14.7	88.1
SWAG + ACT	ViT-L/16	16	333.5	9.5	13.0	85.1
SWAG + ACT	ViT-L/16	20	341.8	9.2	12.4	87.3
SWAG + ACT	ViT-L/16	24	346.2	9.0	11.9	87.8
SWAG + ACT	ViT-L/16	32	352.5	8.8	11.3	88.0
SWAG	ViT-H/14	-	1023.2	5.0	5.3	88.5
SWAG + ACT	ViT-H/14	16	923.3	4.4	5.0	83.8
SWAG + ACT	ViT-H/14	20	943.5	4.3	4.8	87.5
SWAG + ACT	ViT-H/14	24	956.0	4.2	4.6	88.2
SWAG + ACT	ViT-H/14	32	978.5	4.1	4.4	88.5

ATS

Method	Backbone	$\alpha$	GFLOPs	FPS	Throughput (im/s)	Acc.
SWAG	ViT-L/16	-	364.8	12.7	14.7	88.1
SWAG + ATS	ViT-L/16	20	323.3	14.0	16.2	84.0
SWAG + ATS	ViT-L/16	18	306.2	14.8	17.1	80.4
SWAG + ATS	ViT-L/16	16	292.1	15.5	17.9	76.0
SWAG	ViT-H/14	-	1023.2	5.0	5.3	88.5
SWAG + ATS	ViT-H/14	18	731.4	6.8	7.2	88.0
SWAG + ATS	ViT-H/14	16	693.4	7.3	7.7	87.5
SWAG + ATS	ViT-H/14	14	676.3	7.1	7.3	87.1

Evo-ViT

Method	Backbone	pruning location	pruning ratio	GFLOPs	FPS	Throughput (im/s)	Acc.
SWAG	ViT-L/16	-	-	364.8	12.7	14.7	88.1
SWAG + Evo-ViT	ViT-L/16	18	0.4	307.0	14.8	17.2	87.6
SWAG + Evo-ViT	ViT-L/16	8	0.6	259.1	17.7	19.7	86.3
SWAG + Evo-ViT	ViT-L/16	8	0.4	210.6	20.8	25.0	84.0
SWAG	ViT-H/14	-	-	1023.2	5.0	5.3	88.5
SWAG + Evo-ViT	ViT-H/14	16	0.6	799.8	6.0	6.6	88.1
SWAG + Evo-ViT	ViT-H/14	14	0.5	713.5	6.8	7.5	87.4
SWAG + Evo-ViT	ViT-H/14	14	0.3	602.0	8.1	8.9	85.6

SMYRF

Method	Backbone	$\alpha$	rounds	q_cluster_size	k_cluster_size	GFLOPs	FPS	Throughput (im/s)	Acc.
SWAG	ViT-L/16	-	-	-	-	364.8	12.7	14.7	88.1
SWAG + SMYRF	ViT-L/16	12	32	64	64	392	6.01	6.66	86.5
SWAG + SMYRF	ViT-L/16	12	32	32	32	365.6	6.54	7.32	86.3
SWAG + SMYRF	ViT-L/16	12	32	16	16	352.4	6.64	7.48	86.0
SWAG	ViT-H/14	-	-	-	-	1023.2	5.0	5.3	88.5
SWAG + SMYRF	ViT-H/14	16	32	64	64	1063.8	2.77	2.87	87.3
SWAG + SMYRF	ViT-H/14	16	32	32	32	1005.3	2.84	3.01	87.4
SWAG + SMYRF	ViT-H/14	16	32	16	16	975.98	2.89	3.04	87.2

Requirements

This code has been tested to work with Python 3.8, PyTorch 1.10.1 and torchvision 0.11.2.

Note that CUDA support is not required for the tutorials.

To setup PyTorch and torchvision, please follow PyTorch's getting started instructions. If you are using conda on a linux machine, you can follow the following setup instructions -

conda create --name swag python=3.8
conda activate swag
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

To use ToMe, you can install the ToMe library as follows:

pip install torch-scatter
cd models/ops/tome
python setup.py build develop

To use Act-ViT, you can install the ACT library as follows:

cd models/ops
python setup.py install

ImageNet 1K Evaluation

We also provide a script to evaluate the accuracy of our models on ImageNet 1K, imagenet_1k_eval.py. This script is a slightly modified version of the PyTorch ImageNet example which supports our models.

To evaluate our methods, use the command as below.

python imagenet_1k_eval.py -m hourglass_vit_l16_in1k -r 512 -b 256 -l 12 -n 529 /path/to/dataset

To commpute the latency and cost of our methods, use the command as below.

python get_flops_fps.py -m hourglass_vit_l16_in1k -r 512 -b 256 -l 12 -n 529

Citation

If you find Hourglass-SWAG is useful in your research, please give us a star and cite:

@misc{singh2022revisiting,
      title={Revisiting Weakly Supervised Pre-Training of Visual Perception Models}, 
      author={Singh, Mannat and Gustafson, Laura and Adcock, Aaron and Reis, Vinicius de Freitas and Gedik, Bugra and Kosaraju, Raj Prateek and Mahajan, Dhruv and Girshick, Ross and Doll{\'a}r, Piotr and van der Maaten, Laurens},
      journal={arXiv preprint arXiv:2201.08371},
      year={2022}
}

@article{liang2022expediting,
  title={Expediting large-scale vision transformer for dense prediction without fine-tuning},
  author={Liang, Weicong and Yuan, Yuhui and Ding, Henghui and Luo, Xiao and Lin, Weihong and Jia, Ding and Zhang, Zheng and Zhang, Chao and Hu, Han},
  journal={arXiv preprint arXiv:2210.01035},
  year={2022}
}

License

Hourglass-SWAG are released under the CC-BY-NC 4.0 license. See LICENSE for additional details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
models		models
tools		tools
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
get_flops_fps.py		get_flops_fps.py
hubconf.py		hubconf.py
hubconf.pyc		hubconf.pyc
imagenet_1k_eval.py		imagenet_1k_eval.py
inference_tutorial.ipynb		inference_tutorial.ipynb
predict.py		predict.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

Introduction

Results

Ours

EViT

ToMe

ACT

ATS

Evo-ViT

SMYRF

Requirements

ImageNet 1K Evaluation

Citation

License

About

Releases

Packages

Languages

License

Expedit-LargeScale-Vision-Transformer/Expedit-SWAG

Folders and files

Latest commit

History

Repository files navigation

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

Introduction

Results

Ours

EViT

ToMe

ACT

ATS

Evo-ViT

SMYRF

Requirements

ImageNet 1K Evaluation

Citation

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages