Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the huge vision transformer from SWAG #5721

Merged
merged 24 commits into from
Apr 5, 2022
Merged

Conversation

YosuaMichael
Copy link
Contributor

Part of #5708

However we separate the PR from #5714 because the huge models might break the CI.

The purpose of this PR is just to check whether or not the CI is okay with the huge model.

@facebook-github-bot
Copy link

facebook-github-bot commented Apr 1, 2022

💊 CI failures summary and remediations

As of commit e4062f4 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@YosuaMichael YosuaMichael changed the base branch from main to add-swag-weight April 1, 2022 15:17
@YosuaMichael YosuaMichael self-assigned this Apr 1, 2022
Base automatically changed from add-swag-weight to main April 1, 2022 16:09
YosuaMichael and others added 2 commits April 4, 2022 10:36
Co-authored-by: Vasilis Vryniotis <datumbox@users.noreply.github.com>
@datumbox
Copy link
Contributor

datumbox commented Apr 4, 2022

As discussed offline, let's measure how much extra time this adds on our CI. We should also check if after adding it, our CI remains stable across runs. We've previously tried to add this version (see #5210) but then was reverted (see #5259) due to breakages at #5254. So if the tests pass now, we should be sure to understand that they are stable and won't break our CI again.

@YosuaMichael
Copy link
Contributor Author

YosuaMichael commented Apr 4, 2022

As discussed offline, let's measure how much extra time this adds on our CI. We should also check if after adding it, our CI remains stable across runs. We've previously tried to add this version (see #5210) but then was reverted (see #5259) due to breakages at #5254. So if the tests pass now, we should be sure to understand that they are stable and won't break our CI again.

@datumbox I have rerun the test 5x and all are successful, so this indicates that it seems stable now.
I also take note of the additional time the model vit_h_14 introduced:

  • test_windows_cpu: add 70.95s
  • test_windows_gpu: add 27.27s
  • test_extended: add 10.86s

**Note: the additional time added on other platform usually less than the windows platform.

In my opinion this addition is still reasonable and in fact it is quite similar with the existing model regnet_y_128gf in term of the test time.
What do you think about this?

@YosuaMichael
Copy link
Contributor Author

This is the validation script and final output:

python -u ~/script/run_with_submitit.py --timeout 3000 --ngpus 1 --nodes 1 --partition train --model vit_h_14 --data-path="/datasets01_ontap/imagenet_full_size/061417" --test-only --batch-size=1 --weights="ViT_H_14_Weights.IMAGENET1K_SWAG_V1"
# Acc@1 88.552 Acc@5 98.694

@datumbox
Copy link
Contributor

datumbox commented Apr 4, 2022

@YosuaMichael Thanks for the analysis. Have you tried reducing the size of the input to speed up the execution?

We do this for multiple models and even have a list that get their input reduced (you will need to update the expected file):

slow_models = [

Nevertheless, because ViT must be initialized for a specific size, you might need to add a record to the _model_params directly. Could you check it out and let us know how much you can reduce the time by passing smaller input size?

"vit_h_14": {
"image_size": 56,
"input_shape": (1, 3, 56, 56),
},
Copy link
Contributor Author

@YosuaMichael YosuaMichael Apr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datumbox this according to your suggestions on changing the input image_size for the test on vit_h_14 models to speed up the test. The image_size need to be a multiple of patch_size which is 14, hence we use image_size of 56.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reducing the image_size, there is a speedup although not a lot. I observed that the speedup is around 1.5s - 2s for each gpu and cpu test of the model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting. Does this mean that the majority of the time is spent of the model initialization or on the JIT-script parsing?

Copy link
Contributor Author

@YosuaMichael YosuaMichael Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datumbox I did a bit profiling locally and here are the results :

[2022-04-04 16:48:41.768663] Before building the model
[2022-04-04 16:48:45.000341] After building model
[2022-04-04 16:48:45.002153] After model.eval().to(device=dev)
[2022-04-04 16:48:45.207033] After doing model(x)
[2022-04-04 16:48:45.208375] After assert expected
[2022-04-04 16:48:45.208385] After assert shape num_classes
[2022-04-04 16:48:50.452526] After check_jit_scripttable
[2022-04-04 16:48:50.667378] After check_fx_compatible
[2022-04-04 16:48:51.256744] Finish

Seems like around 35% of the time is on building the model, and another 45% of the time is for check_jit_scriptable.

@datumbox datumbox mentioned this pull request Apr 5, 2022
24 tasks
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@YosuaMichael YosuaMichael merged commit 63576c9 into main Apr 5, 2022
@datumbox datumbox deleted the add-vit-swag-huge branch April 5, 2022 15:41
@datumbox datumbox linked an issue Apr 10, 2022 that may be closed by this pull request
facebook-github-bot pushed a commit that referenced this pull request May 5, 2022
Summary:
* Add vit_b_16_swag

* Better handling idiom for image_size, edit test_extended_model to handle case where number of param differ from default due to different image size input

* Update the accuracy to the experiment result on torchvision model

* Fix typo missing underscore

* raise exception instead of torch._assert, add back publication year (accidentally deleted)

* Add license information on meta and readme

* Improve wording and fix typo for pretrained model license in readme

* Add vit_l_16 weight

* Update README.rst

* Update the accuracy meta on vit_l_16_swag model to result from our experiment

* Add vit_h_14_swag model

* Add accuracy from experiments

* Add to vit_h_16 model to hubconf.py

* Add docs and expected pkl file for test

* Remove legacy compatibility for ViT_H_14 model

* Test vit_h_14 with smaller image_size to speedup the test

(Note: this ignores all push blocking failures!)

Reviewed By: jdsgomes, NicolasHug

Differential Revision: D36095649

fbshipit-source-id: 639dab0577088e18e1bcfa06fd1f01be20c3fd44

Co-authored-by: Vasilis Vryniotis <datumbox@users.noreply.github.com>
Co-authored-by: Vasilis Vryniotis <datumbox@users.noreply.github.com>
Co-authored-by: Vasilis Vryniotis <datumbox@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add the SWAG pre-trained weights in TorchVision
3 participants