-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the S3D architecture to TorchVision #6412
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sophiazhi, the PR looks good overall. I have added a few comments, please let me know what you think.
It might be worth looking into the original TF implementation to confirm you implement S3D as expected. I know you don't do the Gated variant but this reference is still valid. As you can see the final It's also worth noting the are using a dropout layer at the end. See this idiom on how to pass the parameter to the network. Finally is there any reason we implement S3D and not it's gated variant S3D-G which is marginally more accurate? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minus reproducing the accuracy, I think we are almost there. Just minor comments left.
tldr; from my point of view this can be merged to unblock but ideally we would follow up and try to close the gap. This is a difficult one, since the gap is larger that we would normally accept (at least from recent contributions, we might find some cases where this was different). I see that @kylemin suggested some changes in the testing setup so might be worth trying them to see if we can close the gap (point 1 you already did as far as I know, but its not clear if you did the others). Another option would be to make it internal only to unblock until we can close the gap. |
class S3D_Weights(WeightsEnum): | ||
KINETICS400_V1 = Weights( | ||
url="https://download.pytorch.org/models/s3d-1bd8ae63.pth", | ||
transforms=partial( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the transforms function should be modified to be consistent with ours. Please refer to this link
@kylemin Thank you very much for the prompt reply. I have some additional questions to ensure we are doing the right thing on our side:
I understand that your method scales the input to [-1, 1]. It's not immediate obvious but that's also what we do. Is there anything other inconsistency you spotted that we should fix?
So instead of resizing the video to 256x256, you propose to do a 5-crop of 224x224 each. Correct?
Sounds good. Do you remember how many 64-frame clips you produce for each Video?
Sounds good, I'll make the change. Thanks for pointing this out. |
@datumbox I am not sure there is another inconsistency, but I believe that modifying the transforms function would improve the accuracy and mostly resolve the issue of performance discrepancy. Yes, I meant 5-crop of 224x224. I remember that this improved the inference accuracy a little bit. I think the rest (dividing a video into multiple segments, jpeg quality, bicubic interp) can wait and are not in high priority because they might not be the reason why there is a non-negligible performance discrepancy. I am unsure if I processed 250 frames at once or not, but I remember that I once tested using four 64-frame clips for each video. Sorry for this uncertainty, but I hope the problem is resolved by modifying the transforms function and scaling the input to [-1, 1]. |
@kylemin Thanks for all the info.
Unfortunately that's what we do already. I know it doesn't look like it immediately when you check the code but basically we:
I thus believe the transform we use is completely aligned with yours. Despite doing this from the beginning, we still face a -5% accuracy gap. I fully appreciate it's been years since you done the work and probably it's hard to remember all the details. I really appreciate the answers. I'll have a look on the rest of the recommendations and let you know what we find. |
I am going to revert the 5/10 crop update I did on transforms because this leads to incompatible batch & collate logic on the reference scripts. The issue is that the aforementioned augmentations make new videos "popup" and thus the length of labels and ids no longer match. It would require far more work to make the references work with this that I don't think it's worth it at this point. Since I plan to hard reset the branch, you can find the full changes here. Part of the reason I'm not making the rest of the required changes is that the original paper explicitly mentions they do single central crops for inference. Switching to BICUBIC interpolation marginally reduces the accuracy by ~0.1. Also digging into the parent repo of Kyle, I see that they are using CV2's default interpolation for resize which is bilinear. I think there might be some additional detail that we might be missing on the inference, possibly related on how the video was scored on the temporal dimension. But I can certainly confirm that our preprocessing is implemented exactly as described on the paper and is aligned with Kyle's repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked the latest changes and LGTM!
thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry didn't get a chance to review yesterday. Had a question about the test input. The code looks good to me.
As for upstreaming to torchvision: we can use our local version S3D for the moment while the weights are getting finalized. Personally, I don't think you need to make this implementation internal as long as the discrepancy of 5% in accuracy is temporarily acceptable.
@@ -312,6 +312,9 @@ def _check_input_backprop(model, inputs): | |||
"mvit_v2_s": { | |||
"input_shape": (1, 3, 16, 224, 224), | |||
}, | |||
"s3d": { | |||
"input_shape": (1, 3, 16, 224, 224), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we looking for this particular input shape during test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The architecture doesn't support the default input size (which is smaller) so here we define the "smallest reasonable" input to make the test work.
@langong347 We have merged the PR and it should appear on the next nightly tomorrow. The weights we use here are identical to the ones you use on your local repo. This means that you don't have to wait for anything to adopt the version contributed by Sophia. |
@datumbox Could we test if loading images in BGR instead of RGB can recover the performance? I found this log of S3D experiments and I think that how each video is scored on the temporal dimension does not affect the performance that much. |
So I've found a script that was used to extract the frames using ffmpeg and to save them in jpeg format. I am not sure this helps at this point.. but anyway I'll leave it here. Thanks
|
@kylemin Thanks so much for digging into the logs and files to help out. I really appreciate the help!
I've tested (see #6461) that but it doesn't help, the accuracy is lower. I think you were scoring RGB frames because your do the conversion from BGR to RGB on your original repo in this line.
I think I understand why you say this now that I can see your preprocessing script at #6412 (comment). It's because after decoding you store the frames as JPEG files. We don't do that on our side, we just directly use the decoded picture. Another difference that I noticed is that you actually resize videos to 256 on the smallest dimension (with a hard max of So the accuracy gap remains strange to me. If you have any other hypothesis, I'm happy to explore it. Otherwise, thanks a lot for your help! |
Training the network from scratch yields a network with the following accuracy using single crops with 128 frames:
Unfortunately the above leads to severe overfitting on the Training set, as we see below:
I think we need to add some higher regularization. The paper is a bit unclear on the hyper-params used but on the TF code-base I see the default value for Dropout=0.2. I'll give that a try and if I still see overfitting I'll tune the weight decay as well. |
I ran with a test with
And the result I got is:
Seems like it still can't reach |
@YosuaMichael Can you try with |
Summary: * S3D initial commit * add model builder code and docstrings * change classifier submodule, populate weights enum * fix change of block args from List[List[int]] to ints * add VideoClassification to transforms * edit weights url for testing, add s3d to models.video init * norm_layer changes * norm_layer and args fix * Overwrite default dropout * Remove docs from internal submodules. * Fix tests * Adding documentation. * Link doc from main models.rst * Fix min_temporal_size * Adding crop/resize parameters in references script * Release weights. * Refactor dropout. * Adding the weights table in the doc Reviewed By: datumbox Differential Revision: D39013679 fbshipit-source-id: 140f7531dcecf65396518e8632f639b3a2a1cfad Co-authored-by: Vasilis Vryniotis <datumbox@users.noreply.github.com> Co-authored-by: Vasilis Vryniotis <vvryniotis@fb.com>
@datumbox using
Here are the full params:
|
@YosuaMichael My understanding form reading the documentation of ffprobe is that the default interpolation is bilinear not bicubic. Worth checking that as well to see if you get an improvement if you have the bandwidth. |
Hi @datumbox , I have tried with BILINEAR and here are what I got: with
with
I think BILINEAR and BICUBIC produce a relatively similar result in this case. |
Thanks @YosuaMichael. I've been training a new model the whole week but due to infra problems I don't have a new model. The accuracy you report is on par with the model that I have currently trained from scratch. I still have overfitting problems so I'm sure there is more to be done here to improve it. I'll report results here as I get them. |
Training a network with Dropout=0.2 yields the following results:
Which is similar to Dropout=0.0. This model is about +1 Acc@1 point higher than the one we deployed but still have roughly 4 points gap from the original paper. I'm still observing significant amount of overfitting as we don't provide currently offer an good way to do augmentations for Video (coming up). I could keep trying to tweak the configuration to avoid overfitting but this job requires lots of resources and currently want to dedicate them on the testing of Transforms. @YosuaMichael Any thoughts on whether we should deploy these weights instead of the ported ones? |
Hi @datumbox, overall I prefer to deploy the new weight as 1 point difference is quite significant. However I have no strong opinion on this since we may also update the weight later once we have more augmentation on video. |
Fixes #6402
Add S3D model for video classification.